Although the common situation is the absence of prior information on \(\mathbf{p} = (\mathbf{p_0},\mathbf{p_1},\dots,\mathbf{p_K})\), in some particular cases pre-sample information exists in the form of \(\mathbf{q} = (\mathbf{q_0},\mathbf{q_1},\dots,\mathbf{q_K})\). This \(\mathbf{q}\) distribution can be used as an initial hypothesis to be incorporated in the consistency relations of maximum entropy formalism. Kullback and Leibler [1] defined cross-entropy (CE) between \(\mathbf{p}\) and \(\mathbf{q}\) as
\[\begin{align} I(\mathbf{p},\mathbf{q})=\sum_{k=0}^K \mathbf{p_k} \ln \left(\mathbf{p_k}/\mathbf{q_k}\right). \end{align}\]
\(I(\mathbf{p},\mathbf{q})\) measures the discrepancy between the \(\mathbf{p}\) and \(\mathbf{q}\) distributions. It is non-negative, and when \(\mathbf{p}=\mathbf{q}\) one gets \(I(\mathbf{p},\mathbf{q})=0\). So, according to the principle of minimum cross-entropy [2,3] probabilities that are as close as possible to the prior probabilities should be chosen.
Given the previous, and for the reparameterized linear regression
model, \[\begin{equation}
\mathbf{y}=\mathbf{XZp} + \mathbf{Vw},
\end{equation}\]
the Generalized Cross Entropy (GCE) estimator is given by
\[\begin{equation}
\hat{\boldsymbol{\beta}}^{GCE}(\mathbf{Z},\mathbf{V}) =
\underset{\mathbf{p},\mathbf{q},\mathbf{w},\mathbf{u}}{\operatorname{argmin}}
\left\{\mathbf{p}' \ln \left(\mathbf{p/q}\right) +
\mathbf{w}' \ln \left(\mathbf{w/u}\right) \right\},
\end{equation}\]
subject to the same model constraints as the GME estimator (see “Generalized Maximum Entropy
framework”).
Using set notation the minimization problem can be rewritten as follows: \[\begin{align} &\text{minimize} & I(\mathbf{p,q,w,u}) &=\sum_{m=1}^M\sum_{k=0}^{K} p_{km}ln(p_{km}/q_{km}) +\sum_{j=1}^J\sum_{n=1}^N w_{nj}ln(w_{nj}/u_{nj}) \\ &\text{subject to} & y_n &= \sum_{m=1}^M\sum_{k=0}^{K} X_{kn}Z_{kj}p_{kj} + \sum_{m=1}^M V_{nm}w_{nm} \\ & & \sum_{m=1}^M p_{km} = 1, \forall k\\ & & \sum_{j=1}^J w_{kj} = 1, \forall k. \end{align}\]
The Lagrangian equation \[\begin{equation}
\mathcal{L}=\mathbf{p}' \ln \left(\mathbf{p/q}\right) +
\mathbf{w}' \ln \left(\mathbf{w/u}\right) +
\boldsymbol{\lambda}' \left( \mathbf{y} - \mathbf{XZp} -
\mathbf{Vw} \right) + \boldsymbol{\theta}'\left(
\mathbf{1}_{K+1}-(\mathbf{I}_{K+1} \otimes \mathbf{1}'_M)\mathbf{p}
\right) + \boldsymbol{\tau}'\left( \mathbf{1}_N-(\mathbf{I}_N
\otimes \mathbf{1}'_J)\mathbf{w}\right)
\end{equation}\]
can be used to find the interior solution, where \(\lambda\), \(\theta\), and \(\tau\) are \((N\times 1)\), \(((K+1)\times 1)\), \((N\times 1)\) associated vectors of
Lagrangian multipliers, respectively.
Taking the gradient of the Lagrangian and solving the first-order
conditions yields the solutions for \(\mathbf{\hat p}\) and \(\mathbf{\hat w}\)
\[\begin{equation} \hat p_{km} = \frac{exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})}{\sum_{m=1}^M exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})} \end{equation}\] and \[\begin{equation} \hat w_{nj} = \frac{exp(-\hat\lambda_n v_{n})}{\sum_{j=1}^J exp(-\hat\lambda_n v_{n})}. \end{equation}\]
Note that when the prior distribution is uniform, maximum entropy and minimum cross entropy produce the same results.
Consider dataGCE
(see “Generalized Maximum Entropy
framework”).
Again under a “no a priori information” scenario for the
parameters, one can assume that \(z_k^{upper}=100\), \(k\in\left\lbrace 0,\dots,6\right\rbrace\)
is a “wide upper bound” for the signal support space. Using
lmgce
a model can be fitted under the GME or GCE framework.
If support.signal.points
is an integer, a constant vector
or a constant matrix one is assuming a uniform distribution on \(\mathbf{q}\) and therefore considering the
GME framework.
res.lmgce.100.GME <-
GCEstim::lmgce(
y ~ .,
data = dataGCE,
cv = TRUE,
cv.nfolds = 5,
support.signal = c(-100, 100),
support.signal.points = 5,
twosteps.n = 0,
seed = 230676
)
The estimated GME coefficients are \(\widehat{\boldsymbol{\beta}}^{GME_{(100)}}=\) (1.026, -0.155, 1.822, 3.319, 8.393, 11.467).
(coef.res.lmgce.100.GME <- coef(res.lmgce.100.GME))
#> (Intercept) X001 X002 X003 X004 X005
#> 1.0255630 -0.1552375 1.8221235 3.3194530 8.3932055 11.4670530
But if there is some information, for instance, on \(\beta_1\) and \(\beta_2\), that can be reflected on
support.signal.points
. Lets suppose that one suspects that
\(\beta_1=\beta_2=0\). Since the
support spaces are centered in zero one can assign a higher probability
to the support point in or around the center. One can set \(\mathbf{q_1}=\mathbf{q_2}=(0.1, 0.1, 0.6, 0.1,
0.1)'\), for instance. support.signal.points
accepts information on the distribution of probabilities in the form of
a \((K+1)\times M\) matrix. The first
line corresponds to \(\mathbf{q_0}\),
the second to \(\mathbf{q_1}\), and so
on.
(support.signal.points.matrix <-
matrix(
c(rep(1/5, 5),
c(0.1, 0.1, 0.6, 0.1, 0.1),
c(0.1, 0.1, 0.6, 0.1, 0.1),
rep(1/5, 5),
rep(1/5, 5),
rep(1/5, 5)
),
ncol = 5,
byrow = TRUE))
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.2 0.2 0.2 0.2 0.2
#> [2,] 0.1 0.1 0.6 0.1 0.1
#> [3,] 0.1 0.1 0.6 0.1 0.1
#> [4,] 0.2 0.2 0.2 0.2 0.2
#> [5,] 0.2 0.2 0.2 0.2 0.2
#> [6,] 0.2 0.2 0.2 0.2 0.2
res.lmgce.100.GCE <-
GCEstim::lmgce(
y ~ .,
data = dataGCE,
cv = TRUE,
cv.nfolds = 5,
support.signal = c(-100, 100),
support.signal.points = support.signal.points.matrix,
twosteps.n = 0,
seed = 230676
)
The estimated GCE coefficients are \(\widehat{\boldsymbol{\beta}}^{GCE_{(100)}}=\) (1.026, -0.143, 1.655, 3.228, 8.189, 11.269).
(coef.res.lmgce.100.GCE <- coef(res.lmgce.100.GCE))
#> (Intercept) X001 X002 X003 X004 X005
#> 1.026345 -0.143421 1.654828 3.227839 8.189040 11.269391
The prediction errors are approximately equal ( \(RMSE_{\mathbf{\hat y}}^{GME_{(100)}}
\approx\) 0.407 and \(RMSE_{\mathbf{\hat y}}^{GCE_{{100}}}
\approx\) 0.407) as well as the prediction cross-validation
errors ( \(CV\text{-}RMSE_{\mathbf{\hat
y}}^{GME_{(100)}} \approx\) 0.428 and \(CV\text{-}RMSE_{\mathbf{\hat y}}^{GCE_{{100}}}
\approx\) 0.427).
The precision errors is lower for the GCE approach: \(RMSE_{\boldsymbol{\hat\beta}}^{GME_{(100)}}
\approx\) 1.595 and \(RMSE_{\boldsymbol{\hat\beta}}^{GCE_{(100)}}
\approx\) 1.458.
(RMSE_beta.lmgce.100.GME <-
GCEstim::accmeasure(coef.res.lmgce.100.GME, coef.dataGCE, which = "RMSE"))
#> [1] 1.594821
(RMSE_beta.lmgce.100.GCE <-
GCEstim::accmeasure(coef.res.lmgce.100.GCE, coef.dataGCE, which = "RMSE"))
#> [1] 1.457947
If there was some information on the distribution of \(\mathbf{w}\), a similar analysis could be
done for noise.signal.points
.
The minimum cross entropy formalism specifies weights that should be considered to improve the precision of estimations.
This work was supported by Fundação para a Ciência e Tecnologia (FCT) through CIDMA and projects https://doi.org/10.54499/UIDB/04106/2020 and https://doi.org/10.54499/UIDP/04106/2020.