robRatio
The functions contained in this package are originally prepared for ratio imputation for official statistics. The ratio model is one of the popular tools for imputation, as the model can accomodate heteroscedastic datasets without any data transformation. On the other hand, the heteroscedastic error term of the ratio model prevents robustification by means of M-estimation.
The functions for ratio models obtain robustness by extracting the homoscedastic part from the heteroscedastic residuals. Furthermore, the conventional ratio model is generalized by parameterizing the relationship between the error term and the explanatory variable. For details, refer to the next section.
The conventional ratio model is \(y_i = \beta x_i + \epsilon_i, \; (i=1, \ldots, n)\), where \(x_i\) is an exlanatory variable and \(y_i\) is a dependent variable. The error term \(\epsilon_i\) is heteroscedastic.
For robustification, the heteroscedastic error term \(\epsilon \sim N(0, x_i\sigma^2)\) was reformulated with the homoscedastic error \(\varepsilon \sim N(0, \sigma^2)\) as in the linear regression model.
Since these error terms have the relation, \(\varepsilon_i = \epsilon_i / \sqrt(x_i)\) (Cochran, 1977), the ratio model can be expressed as \(y_i = \beta x_i + \varepsilon_i \sqrt(x_i)\). This re-formulation yields homoscedastic quasi-residuals of the ratio model, which is obtained by dividing the residuals by the square root of \(x_i\).
By replacing the error term \(\varepsilon_i \sqrt(x_i)\) with \(\varepsilon_i x_i^{\gamma}\), the ratio model was generalized. The new parameter \(\gamma\) controls the relationship between the error term and the explanatory variables. The generalized ratio model and its estimation equation is as follows:
\[ y_i = \beta x_i + x_i^\gamma \varepsilon_i. \]
\[ \hat{\beta} = \frac{\sum^{n}_{i=1} y_i x_i^{1-2 \gamma}}{\sum^n_{i=1} x_i^{2(1-\gamma)}}, \]
The quasi-residual \(\check{r}_i\) is,
\[ \check{r}_i = \frac{y_i - \hat{\beta} x_i}{x_i^\gamma}. \]
The robustified estimator for the generalized ratio model is derived by means of M-estimation as follows:
\[ \hat{\beta}_{rob} = \frac{\sum{w_i y_i x_i^{1-2\gamma}}}{\sum{w_i x_i^{2 (1-\gamma)}}}, \]
where \(w_i\) is obtained using a weight function for quasi residuals \(\check{r}\). The role of the weight function is to alleviate influence of the observations with large residuals. The following two are the most popular functions among them. One is Tukey’s biweight function,
\[ w_i = w \left(\frac{\check{r}_i}{\hat{\sigma}} \right) = w(e_i) = \left\{ \begin{array}{ll} \left[1-\left( e_i / c\right)^2\right]^2 & |e_i| \le c \\ 0 & |e_i| > c, \end{array} \right. \\ \]
and the other is Huber’s weight function,
\[ w_i = w \left(\frac{\check{r}_i}{\hat{\sigma}} \right) = w(e_i) = \left\{ \begin{array}{ll} 1 & |e_i| \le k \\ k/|e_i| & |e_i| > k. \end{array} \right. \\ \]
The standardised residuals \(e_i\) are quasi-residuals \(\check{r}_i\) divided by an estimated scale parameter \(\hat{\sigma}\). Average absolute deviation (AAD) or median absolute deviation (MAD) is used as the scale parameter, and the tuning constants are shown in the section “About scales of residuals” below. Quasi-residuals \(\check{r}_i\) based on the homoscedastic quasi-error term \(\varepsilon_i\) are obtained by
\[ \check{r}_i = \frac{y_i - \hat{\beta}_{rob} x_i}{x_i^\gamma}. \]
The iterative re-weighted least squares (IRLS) algorithm is used for estimation regarding both models (i.e. regression model and ratio models). See Bienias et al.(1997) and other paper in the reference section.
This package contains two parent functions and an experimental
function without any child functions.
The parent functions are robRatio()
for ratio models and
robReg()
for multivarilate linear regression models.
Both robRatio()
and robreg()
have two child
functions. One uses Tukey’s biweight function and the other, Huber’s
weight function to calculating robust weight.
The experimental one is robGR()
. It estimates two
parameters of the generalized ratio model simultaneously.
The following table shows the relationship of each function with their models, weight functions, and scales of residuals.
Model | Parent function | Child function | Weight function | Scale for residuals | Gamma value |
---|---|---|---|---|---|
Generalized Ratio | robRatio | RrT.aad | Tukey’s biweight | AAD | Specified by user |
Generalized Ratio | robRatio | RrT.mad | Tukey’s biweight | MAD | Specified by user |
Generalized Ratio | robRatio | RrH.aad | Huber weight | AAD | Specified by user |
Generalized Ratio | robRatio | RrH.mad | Huber weight | MAD | Specified by user |
Regression | robReg | Tirls.aad | Tukey’s biweight | AAD | Not applicable |
Regression | robReg | Tirls.mad | Tukey’s biweight | MAD | Not applicable |
Regression | robReg | Hirls.aad | Huber weight | AAD | Not applicable |
Regression | robReg | Hirls.mad | Huber weight | MAD | Not applicable |
Generalized Ratio | robGR | - | Tukey’s biweight | AAD | Estimate simultaneously |
The parent functions, robRatio()
and
robReg()
has a tuning parameter tp
. It accepts
only 4, 6, or 8; however, actual figures of tuning costant differs
depending on the weight function and scale of residuals.
The following table shows the relation of the tuning parameter
tp
of robRatio()
, robReg()
, and
their child functions’ c1
. The parameter tp
of
the parent functions is standardized based on Tukey’s biweight function
with AAD scale so that users are not bothered by differences in weight
functions and scales. If users would like to try tuning constants other
than the given values (tp=4, 6, or 8), use the child functions
(RrT()
, RrH()
, Tirls()
, and
Hirls()
) directly and set the actual value of the tuning
constant for c1
. Please note that the return value of R’s
mad()
is standardized according to the SD(standard
deviation), so choose the value for SD scale for c1
when
using the MAD scale.
weight | scale | tp=4 | tp=6 | tp=8 |
---|---|---|---|---|
Tukey | SD | c1=5.01 | c1=7.52 | c1=10.03 |
Tukey | AAD | c1=4 | c1=6 | c1=8 |
Tukey | MAD | c1=3.38 | c1=5.07 | c1=6.76 |
Huber | SD | c1=1.44 | c1=2.16 | c1=2.88 |
Huber | AAD | c1=1.15 | c1=1.72 | c1=2.30 |
Huber | MAD | c1=0.97 | c1=1.46 | c1=1.94 |
The IRLS algorithm is based on Bienias et al. (1997). The tuning parameter ‘tp’ should be set between tp=4(more robust) and tp=8(less robust). For the child functions’ ‘c1’, values within the range shown in the table above are expected but not limited.
Bienias, J. L., Lassman, D. M. Scheleur, S. A. and Hogan H. (1997) Improving Outlier Detection in Two Establishment Surveys. Statistical Data Editing 2 - Methods and Techniques. (UNSC and UNECE eds.), pp.76-83. URL: https://digitallibrary.un.org/record/240780?v=pdf
Cochran, W. G. (1977) Sampling Techniques, 3rd ed., John Wiley & Sons.
Wada, K (2012). Detection of Multivariate Outliers — Regression Imputation by the Iteratively Reweighted Least Squares | (in Japanese). Research Memoir of Official Statistics,Vol.69, pp.23–52. URL https://www.stat.go.jp/training/2kenkyu/ihou/69/pdf/2-2-692.pdf.
Wada, K, Noro, T (2019). Consideration on the influence of weight functions and the scale for robust regression estimator (in Japanese).” Research Memoir of Official Statistics, Vol.76, pp.101–114. URL https://www.stat.go.jp/training/2kenkyu/ihou/76/pdf/2-2-767.pdf.
Wada,K, Sakashita, K. and Tsubaki, H. (2021). Robust Estimation for a Generalised Ratio Model. Austrian Journal of Statistics, Vol.50, pp.74-–87. URL https://doi.org/10.17713/ajs.v50i1.994