Abstract
The R package emdi facilitates the estimation of regionally disaggregated indicators using small area estimation methods and provides tools for model building, diagnostics, presenting, and exporting the results. The package version 1.1.7 includes unit-level small area models that rely on access to micro data. The area-level model by Fay and Herriot (1979) and various extensions have been added to the package since the release of version 2.0.0. These extensions include (a) area-level models with back-transformations, (b) spatial and robust extensions, (c) adjusted variance estimation methods, and (d) area-level models that account for measurement errors. Corresponding mean squared error estimators are implemented for assessing the uncertainty. User-friendly tools like a stepwise variable selection, model diagnostics, benchmarking options, high quality maps and results exportation options enable a complete analysis procedure. The functionality of the package is illustrated by examples based on synthetic data for Austrian districts.Small area estimation (SAE) enables better insight at smaller scales, for which it has gained importance in both academic and applied research. Among others, SAE is used for estimating socio-economic measures like income, poverty and health or indicators for agriculture (Datta, Fay, and Ghosh 1991; Tzavidis et al. 2012; Zhang et al. 2015; Pratesi 2016). Economic or political decision-makers and official statistics practitioners especially benefit from reliable estimation of disaggregated indicators and thus SAE methods. Existing surveys were often not planned for analysis at disaggregated levels and show only small sample sizes, which leads to a low precision of the estimates. SAE methods can be employed to avoid expensive and time-consuming enlargements of the sample size of surveys. The idea is to combine data sources with model-based approaches. Existing survey data will be enriched by auxiliary information, e.g., from census or register data, to improve the accuracy of the indicator estimation on an area- or domain- level. The terms area and domain are used interchangeably and refer either to a geographic area or to any subpopulation of a population of interest, like socio-demographic groups. Among others, Pfeffermann (2013), Rao and Molina (2015), Tzavidis et al. (2018) and Jiang and Rao (2020) give comprehensive overviews of SAE methods.
The main goal of the package emdi is the simplification of estimating these regionally disaggregated indicators. The package version 1.1.7 contains direct estimation based exclusively on survey data and model-based estimation using the unit-level empirical best predictor (EBP) method (Molina and Rao 2010). The EBP approach is powerful since it enables the simultaneous estimation of various indicators. For this, it relies on unit-level information, i.e. information about each unit in each domain. Though survey data often provides unit-level information, access to census or register data at unit-level is less likely. Hence, area-level models provide a valuable alternative, with the following benefits: First, only area-level aggregates are needed to estimate the regional indicators. Second, area-level models can consider the survey design by integrating the sampling weights. Third, the computation is faster compared to the computational intensive EBP approach.
Various R packages that employ different area-level models are
available on the Comprehensive R Archive Network (CRAN). The package smallarea
(Nandy, 2015) offers several variance estimation methods for the
standard Fay-Herriot (FH) model: maximum likelihood (ML), residual
maximum likelihood (REML), and both Prasad-Rao and Fay-Herriot
method-of-moment. Estimation of unknown sampling variances is also
offered." The ability to estimate unit- and area-level models under
heteroscedasticity is implemented by the JoSAE package (Breidenbach 2018). Robust estimation of
area-level models with spatial and/or temporal structures in the random
effects is supported by package saeRobust (Sebastian Warnholz 2022). The mcmcsae package
(Boonstra 2021) also takes spatial and
temporal correlation of the random effects into account, but fits unit-
and area-level models via Markov Chain Monte Carlo simulation.
Estimation of univariate and multivariate FH models is possible with
package msae (Permatasari and Ubaidillah 2022). The package
hbsae (Boonstra 2022) allows for the fitting of unit-
and area-level models by frequentist or hierarchical Bayesian
approaches. The possibility of estimating FH models and some of its
extensions in a Bayesian framework is also given by the BayesSAE package
(Developer 2018). The tipsae package
(De Nicolò and Gardini 2022) provides
estimation and mapping tools within a Bayesian setting for proportions
that are defined on the unit interval. The mme package (Lopez-Vizcaino, Lombardia, and Morales 2019)
implements Gaussian area-level multinomial mixed-effects models in the
SAE context. The saeME package (Mubarak and Ubaidillah 2022) can fit an
area-level model when the auxiliary variables are measured with error.
The NSAE package
(Hukum Chandra et al. 2022) can fit
stationary and nonstationary FH models. One of the commonly used
packages is the sae
package (Molina and Marhuenda 2015). It
includes a wide range of area-level models (the standard FH model with
REML, ML and FH method-of-moment model fitting and a spatial and a
spatio-temporal extension of the FH model) and unit-level models (the
nested error linear regression model of Battese,
Harter, and Fuller (1988) and the EBP approach). Table \[tab:packages\] gives an overview of
selected packages and the implemented methodology.
Besides packages that include particular area-level models, the packages
saeMSPE (Xiao et al. 2022) and SAEval (Fasulo 2022) offer different analytical- and
resampling-based MSE estimators and tools for diagnostics and graphical
evaluation of SAE models, respectively.
The latest version of package emdi 2.1.3 combines a wide range of SAE models with several tools that enable a complete analysis, and therefore adds to the space of useful packages, for the following reasons:
None of the existing packages contains such a variety of different area-level models.
In addition to models that are already available in existing R packages, emdi includes: adjusted variance estimation methods and transformation options for the standard FH model. Adjusted variance estimation methods are of particular importance when working in a non-Bayesian framework. In a Bayesian context, the variance will always be estimated as strictly positive, so packages providing a Bayesian approach do not need adjusted variance estimation methods.
Package emdi offers user-friendly tools that go beyond model estimation: diagnostic tools with both summary and graphical results, benchmarking options, high-quality geographical visualization of results, and export of results to Excel and OpenDocument Spreadsheet formats.
Plus a stepwise variable selection algorithm for area-level models is included in emdi to allow the user to build a model based on information criteria.
Thus, since package version 2.0.0, version 1.1.7 has been extended by various area-level models, but stays in line with the user-friendly orientation of the existing version.
The structure of the paper can be described as follows. The section introduces the statistical methods implemented in the package. The example data sets included in the package are presented in the section . The section provides an illustrative description of the functions using the example data sets. The section guides the reader from model-building to diagnostics of a standard FH model and creating maps of the results. The section follows with short descriptions of how to build the different extended area-level models. Finally, the section summarizes our contributions and gives an outlook.
@X @z @z @z @z @z @z @z @z @z &
Area-level model &
70 smallarea
&
70 JoSAE
&
70 sae
&
70 saeRobust
&
70 msae
&
70 hbsae
&
70 BayesSAE
&
70 saeME
&
70 emdi
Standard variance estimation & \(\surd\) & & & & \(\surd\) & \(\surd\) & & & \(\surd\)
Adjusted variance estimation & & & & & &
& & & \(\surd\)
Unknown sampling variances & \(\surd\) & & & & & &
& &
Heteroscedasticity & & \(\surd\) & & & & & &
&
Spatial correlation & & & \(\surd\) & & & & & &
\(\surd\)
Spatio-temporal correlation & & & \(\surd\) & & & & &
&
Robust & & & & \(\surd\) & & & & & \(\surd\)
Robust, spatial correlation & & & & \(\surd\) & & & & & \(\surd\)
Robust, (spatio-)temporal correlation & & & & \(\surd\) & & & & &
Multivariate & & & & & \(\surd\) & & & &
Bayesian formulation & & & & & & \(\surd\) & \(\surd\) & &
Measurement error & & & & & & & &
\(\surd\) & \(\surd\)
Transformation (log, arcsin) & & & & & &
& & & \(\surd\)
Area-level models for the estimation of indicators like means, totals or shares have been added to the package since the release of version 2.0.0. These comprise the area-level model by Fay and Herriot (1979) and several extensions of this standard model which account for issues that may come up in real data applications. To measure the precision of those models, MSE estimators have been integrated following the literature.
Throughout the paper, a finite population \(U\) is assumed that consists of \(N\) units that are subdivided into \(D\) domains or areas of specific sizes \(N_1,...,N_D\). Then a random sample of size \(n\) can be drawn from \(U\) and partitioned into \(D\) areas with \(n_1, ..., n_D\) observations per domain.
The FH model links area-level direct estimators based on survey data
to covariates aggregated on an area level that may stem from
administrative data (e.g. register or census) or alternative data
sources (e.g. satellite, social media or mobile phone data). The FH
model is composed of two levels. The first level is the sampling model
\[\hat{\theta}_{i}^{\text{Dir}}=\theta_{i} +
e_{i}, \quad i=1,\ldots,D.\] \(\hat{\theta}_{i}^{\text{Dir}}\) is an
unbiased direct estimator for a population indicator of interest \(\theta_{i}\), for instance, a mean or a
ratio. \(e_{i}\) represents independent
and normally distributed sampling errors with \(e_{i} \stackrel{ind}{\sim}
N\left(0,\sigma_{e_{i}}^2\right)\). Though the model assumes
known sampling variances, in practical applications \(\sigma_{e_{i}}^2\) are usually unknown, and
have to be estimated from the unit-level sample data (Rivest and Vandal 2003; Wang and Fuller 2003; You and
Chapman 2006). Package emdi provides a non-parametric bootstrap
for estimating the variances of the direct estimator (Alfons and Templ 2013). To allow for complex
survey designs, sampling weights (\(w\)) can be considered in the direct
estimation (Horvitz and Thompson 1952).
For example, an estimator for the population mean \(\theta_{i}\) of a continuous variable of
interest \(y\) for each area \(i\) is estimated by \[\hat{\theta}_{i}^{\text{Dir}} =
\frac{\sum_{j=1}^{n_i} w_{ij} y_{ij}}{\sum_{j=1}^{n_i} w_{ij}},\]
where the index \(j\) indicates an
individual with \(j = 1, ..., n_i\) in
the \(i\)-th area. The second FH level
links the target indicator \(\theta_i\)
linearly to area-specific covariates \(\boldsymbol{x}_i\), \[\theta_i = \boldsymbol{x}^{\top}_i
\boldsymbol{\beta} +u_{i},\] where \(\boldsymbol{\beta}\) is a vector of unknown
fixed-effect parameters, and \(u_i\) is
an independent and identically normally distributed random effect with
\(u_{i} \, \stackrel{iid}{\sim}
N\left(0,\sigma_{u}^2\right)\).
The combination of the sampling and the linking model leads to a special
linear mixed model \[\hat{\theta}_{i}^{\text{Dir}}=\boldsymbol{x}^{\top}_i
\boldsymbol{\beta} +u_{i}
+ e_{i},
\quad i=1,\ldots,D.
\label{eq:Combination}\] The empirical best linear unbiased
estimators \(\boldsymbol{\hat{\beta}}\)
are computed by weighted least square theory. The empirical best linear
unbiased predictor (EBLUP) of \(\theta_i\) is obtained by substituting the
variance parameter \(\sigma_u^2\) with
an estimate. The resulting estimator can then be written as \[\begin{aligned}
\nonumber \hat{\theta}_i^{\text{FH}}&=
\boldsymbol{x}_i^{\top}\boldsymbol{\hat{\beta}} + \hat{u}_i \\
&=\hat{\gamma}_i\hat{\theta}_i^{\text{Dir}}+\left(1-\hat{\gamma}_i\right)
\boldsymbol{x}_i^{\top}\boldsymbol{\hat{\beta}}.
\label{eq:EBLUP}
\end{aligned}\] The EBLUP/FH estimator can be understood as a
weighted average of the direct estimator \(\hat{\theta}_i^{\text{Dir}}\) and a
regression-synthetic part \(\boldsymbol{x}_i^{\top}\boldsymbol{\hat{\beta}}\).
The estimated shrinkage factor \(\hat{\gamma}_i=\frac{\hat{\sigma}_u^2}{\hat{\sigma}_u^2+\sigma_{e_{i}}^2}\)
puts more weight on the direct estimator when the sampling variance is
small and vice versa. Areas for which no direct estimation results
exist, because the sample size is zero or the results may not be
published, are called out-of-sample domains. For those domains, the
prediction reduces to the regression-synthetic component \(\hat{\theta}_{i,\text{out}}^{\text{FH}} =
\boldsymbol{x}_i^{\top}\boldsymbol{\hat{\beta}}\) (Rao and Molina 2015).
Estimation methods for \(\boldsymbol{\sigma_u^2}\)
The variance of the random effects has to be estimated. Commonly used
approaches are the FH method-of-moment estimator (Fay and Herriot 1979), the ML, and the REML
estimators (Rao and Molina 2015). The
likelihood methods are known to perform more efficiently than the
methods of moments (Rao and Molina 2015).
The commonly used methods can produce negative variance estimates that
should be strictly positive. In the estimation methods mentioned above,
negative variance estimates are set to zero (\(\hat{\sigma}_{u}^{2} =
\max\left(\tilde{\sigma}_{u}^{2}, 0\right)\)) resulting in zero
estimates of the shrinkage factor \(\gamma_i\). Therefore, no weight is put on
the direct estimator, ignoring its possible reliability. This poses a
problem, especially when the number of areas is small. To avoid this
so-called over-shrinkage problem, Li and Lahiri
(2010) and Yoshimori and Lahiri
(2014) proposed methods that adjust the respective likelihoods of
the standard ML and REML approaches by some factor: \[L_{\text{adj}}\left(\sigma_u^{2}\right) = A
\times L\left(\sigma_u^{2}\right),\] where \(A\) denotes the adjustment factor and \(L(\sigma_u^{2})\) the given likelihood
function. The proposed adjustment factors are:
by Li and Lahiri (2010): \(A = \sigma_u^{2}\),
by Yoshimori and Lahiri (2014): \(A = \left( \tan^{-1} \left(\sum\limits_{i=1}^{D} \gamma_i \right)\right)^{1/D}\).
Simulation studies conducted by Yoshimori and Lahiri (2014) showed that the adjusted Yoshimori-Lahiri methods are preferable when the variance of the random effect is small relative to the sampling variance. Otherwise, the adjusted Li-Lahiri methods are recommended. Package emdi offers six different variance estimation methods: standard ML ( ml) and REML ( reml), and adjusted ML and REML following either Li and Lahiri (2010) ( amrl, ampl) or Yoshimori and Lahiri (2014) ( amrl_yl, ampl_yl).
In real data applications, problems might occur that were not
theoretically expected. There may also be some violation of the
assumptions of the standard FH model, e.g., normality and independence
of the error terms. The following section outlines the extensions of the
standard FH model that are implemented in the package emdi to address
these issues.
Transformations
When working with right-skewed data like income, wealth or business
data, the assumptions of a linear relation between the response and the
explanatory variables and normality of both error terms (\(u_i\) and \(e_i\)) of the FH model may be violated.
Applying a log-transformation could be a reasonable solution to meet
these model assumptions (Neves, Silva, and Correa
2013; A.-K. Kreutzmann et al. 2022). In the emdi package, the
direct estimates and their variances are transformed following Neves, Silva, and Correa (2013): \[\begin{aligned}
\hat{\theta}_{i}^{\text{Dir*log}} &=
\log\left(\hat{\theta}_{i}^{\text{Dir}}\right), \\
\text{var}\left(\hat{\theta}_{i}^{\text{Dir*log}}\right) &=
\left(\hat{\theta}_{i}^{\text{Dir}}\right)^{-2}
\text{var}\left(\hat{\theta}_{i}^{\text{Dir}}\right),
\end{aligned}\] where the \(\text{*log}\) notation stands for the
logarithmic transformed scale. To obtain the FH estimator on the
transformed scale \(\hat{\theta}_i^{\text{FH*log}}\), \(\hat{\theta}_{i}^{\text{Dir}}\) is
substituted by \(\hat{\theta}_{i}^{\text{Dir*log}}\) and
\(\text{var}(\hat{\theta}_{i}^{\text{Dir*log}})\)
serves as estimate for the sampling variances (\(\sigma_{e_{i}}^2\)) in Equation \[eq:EBLUP\]. Since the logarithm is a
nonlinear transformation, the final FH estimates on the original scale
require a bias-corrected back-transformation (Slud and Maiti 2006; Sugawasa and Kubokawa
2017). The emdi package provides two options:
A crude method ( bc_crude) that takes the properties of the log-normal distribution into account: \[\hat{\theta}_i^{\text{FH, crude}} = \exp \left\{ \hat{\theta}_i^{\text{FH*log}} + 0.5 \text{MSE}\left(\hat{\theta}_i^{\text{FH*log}}\right) \right\}.\]
A bias correction suggested by Slud and Maiti (2006) ( bc_sm) that further regards the bias due to the random effects: \[\hat{\theta}^{\text{FH, Slud-Maiti}}_{i} = \exp\left\{ \hat{\theta}^{\text{FH*log}}_{i} + 0.5 \hat{\sigma}_{u}^{2} \left(1 - \hat{\gamma}_{i}^{\text{*log}}\right)\right\}.\]
The FH estimator on the transformed scale is denoted by \(\hat{\theta}_i^{\text{FH*log}}\) and, accordingly \(\text{MSE}(\hat{\theta}_i^{\text{FH*log}})\) stands for a MSE estimator on the transformed scale, e.g., the Prasad-Rao or Datta-Lahiri MSE (cf. following subsection). The Slud-Maiti back-transformation is derived for the ML variance estimation of the random effect and is implemented for in-sample domains in emdi. In the presence of out-of-sample domains, the crude method can be applied, which allows to use also other variance estimation methods.
Another transformation provided by the emdi package is the arcsin transformation, which is widely used when the direct estimator of the FH model is a ratio (Casas-Cordero, Encina, and Lahiri 2016; Schmid et al. 2017). The emdi package automatically transforms the direct estimates and the sampling variances as suggested by Jiang et al. (2001): \[\begin{aligned} \hat{\theta}_{i}^{\text{Dir*arcsin}} &= \sin^{-1}\left(\sqrt{ \left(\hat{\theta}_{i}^{\text{Dir}}\right)}\right), \\ \text{var}\left(\hat{\theta}_{i}^{\text{Dir*arcsin}}\right) &= 1 / \left(4 \tilde{n}_i\right), \end{aligned}\] where the \(\text{*arcsin}\) denotes the arcsin transformed scale, and \(\tilde{n}_i\) the effective sample size: the sample size adjusted by the sampling design (Jiang et al. 2001). The FH model is estimated using Equation \[eq:EBLUP\] and, if necessary, the results are truncated to the interval \([0, \pi / 2]\) to ensure final results between 0 and 1. To obtain final estimates on the original scale, the final estimation results must be subjected to a back-transformation. Two different back-transformations are available in emdi:
Spatial FH model
The standard FH model assumes independence of the random effects.
However, when working with geographical areas, assuming correlated
random effects to incorporate a certain neighbouring structure can be
valuable. The emdi package contains the spatial FH model introduced by
Petrucci and Salvati (2006) that considers
a simultaneously autoregressive process of order one, SAR(1). Compared
to the standard model, the estimation differs mainly by discarding the
assumptions of independent random effects and estimating a spatial
autoregressive coefficient (\(\rho\))
which takes values between \(-1\) and
\(1\). Greater absolute values of
(\(\rho\)) indicate a stronger
relationship with the neighboring areas. The random effect \(u_i\) in Equation \[eq:Combination\] is replaced by \[\begin{aligned}
\boldsymbol{u} = \rho_1 \boldsymbol{W} \boldsymbol{u} +
\boldsymbol{\epsilon},
\text{ } \boldsymbol{\epsilon} \sim
N\left(\boldsymbol{0}_D,\sigma_{1}^2 \boldsymbol{I}_D\right),
\label{eq:u1}
\end{aligned}\] with \(\boldsymbol{W}\) being the \(D \times D\) row standardized proximity
matrix that describes the neighbourhood structure of the areas, \(\boldsymbol{0}_D\) a vector of zeros and
\(\boldsymbol{I}_D\) the \(D \times D\) identity matrix. The random
effects \(\boldsymbol{u}\) of
Equation \[eq:u1\] follow a SAR(1). When
normality of the random effects is assumed, the model can be fitted by
ML and REML. The application of spatial FH models should be considered
when no geographic auxiliary variables are available to capture the
spatial relation, or when \(\rho_1\) is
larger than 0.5 (Bertarelli et al. 2021).
Even before estimating the model, the emdi package enables testing for
spatial correlation by Moran’s I and Geary’s C statistics (Cliff and Ord 1981; Pratesi and Salvati 2008).
While Moran’s I mimics a typical correlation coefficient whose values
range from \(-1\) and \(1\), Geary’s C takes values between 0 and 2
(0: positive, 1: no, 2: negative spatial autocorrelation). The two
statistics behave inversely to each other.
Robust area-level models
In the case of influential outlying observations, the emdi package
allows for robust versions of the standard and the spatial FH model. The
theory is extensively studied in S. Warnholz
(2016), wherein the robust estimation procedure for linear mixed
models suggested by Sinha and Rao (2009)
was extended to area-level models. The model fitting can be understood
as a robustified ML version that also contains an influence function
with a tuning constant k. 1.345 is recommended as an initial value for
the tuning constant (Sinha and Rao 2009).
When non-symmetric outliers are expected to influence the robust
estimation, a bias correction should be involved. This correction can be
controlled by a multiplier constant ( mult_constant). For further
details, we also refer to R. Chambers et al.
(2014) and Schmid et al.
(2016).
Measurement error model
The standard FH model is based on the assumption that the covariates are
measured without error (Fay and Herriot
1979). This characteristic is typically assumed because census or
register data are used as auxiliary information. However, when the
covariate information stems from larger surveys or alternative data
sources, this assumption can be violated. The emdi package includes an
implementation of the measurement error (ME) model developed by Ybarra and Lohr (2008). To account for the ME in
the covariates \(\boldsymbol{x}_i\),
they modified the shrinkage factor as follows: \[\gamma_i = \frac{\sigma_u^2 +
\boldsymbol{\beta}^{\top} \boldsymbol{C}_i
\boldsymbol{\beta}}{\sigma_u^2 + \boldsymbol{\beta}^{\top}
\boldsymbol{C}_i
\boldsymbol{\beta} + \sigma_{e_i}^2},\] where the \(\boldsymbol{C}_i\) stands for the
variance-covariance matrix of the covariates, which is a required
prerequisite for the model. The modified shrinkage factor pulls more
weight on the direct estimator when the variances of the covariates are
large. A modified weighted least squares method and a moment estimator
were used to estimate \(\boldsymbol{\beta}\)s and the \(\sigma_u^2\), respectively. Additional
details are available in Ybarra and Lohr
(2008).
To evaluate the accuracy of the EBLUP estimates, the MSE is the most common measure used in SAE (Rao and Molina 2015). The emdi package offers a variety of MSE estimators stemming from both analytical determination and resampling strategies, like the bootstrap and jackknife methods. Table \[tab:overviewmse\] gives an overview of the included MSE approaches. For each area-level model presented in the previous sections, the provided MSE types are shown. The quoted references detail extensive formulas and derivations. As an additional measure of variability of the direct and FH estimates, within various functions and methods of the emdi package, the coefficient of variation (CV) is provided: \(CV = \sqrt{\widehat{\text{MSE}}(\hat{\theta}_{i})}/\hat{\theta}_{i}\), where \(\hat{\theta}_{i}\) either stands for \(\hat{\theta}_{i}^{\text{Dir}}\) or \(\hat{\theta}_{i}^{\text{FH}}\).
BXX Model & Type of MSE & Reference
Standard FH (depending on variance estimation of \(\sigma^2_{u}\))
ml/ ampl_yl & Analytical & Datta and
Lahiri (2000)
reml/ amrl_yl & Analytical & Prasad and
Rao (1990)
ampl/ amrl & Analytical & Li and Lahiri
(2010)
ml/ reml (out-of-sample) & Analytical & Rao and Molina (2015)
Transformations
log (depending on back-transformation)
bc_crude & Analytical & Rao and Molina
(2015)
bc_sm & Analytical & Slud and Maiti
(2006)
arcsin (depending on back-transformation)
naive & Jackknife & Jiang et al.
(2001)
& Weighted Jackknife & Jiang et al.
(2001);
& & Chen and Lahiri (2002)
& Parametric bootstrap &
Hadam, Würz, and Kreutzmann (2020) bc
& Parametric bootstrap & Hadam, Würz, and
Kreutzmann (2020)
Spatial FH (depending on variance estimation)
ml/ reml & Analytical & Singh, Shukla,
and Kundu (2005)
ml/ reml &
Parametric bootstrap & Molina, Salvati,
and Pratesi (2009)
reml &
Nonparametric bootstrap & Molina, Salvati,
and Pratesi (2009)
Robust FH
& Pseudolinear & S. Warnholz
(2016)
& Parametric bootstrap & S. Warnholz
(2016)
FH with ME
& Jackknife & Jiang, Lahiri, and Wan
(2002)
The emdi package version 1.1.7 contains a sample ‘eusilcA_smp’ and a population data set ‘eusilcA_pop’ at a household level.
| Variable | Meaning |
|---|---|
| Sample data set | |
| Domain | Austrian districts |
| Mean | Mean of the equivalized household income |
| MTMED | Share of households who earn more than the national median income |
| Cash | Mean employee cash or near cash income |
| Var_Mean | Variance of equivalized household income |
| Var_MTMED | Variance of share of households who earn more than the national median income |
| Var_Cash | Variance of employee cash or near cash income |
| n | Effective sample sizes |
| Population data set | |
| Domain | Austrian districts |
| eqsize | Equivalized household size according to the modified OECD scale |
| cash | Employee cash or near cash income |
| self_empl | Cash benefits or losses from self-employment (net) |
| unempl_ben | Unemployment benefits (net) |
| age_ben | Old-age benefits (net) |
| surv_ben | Survivor’s benefits (net) |
| sick_ben | Sickness benefits (net) |
| dis_ben | Disability benefits (net) |
| rent | Income from rental of a property or land (net) |
| fam_allow | Family/children related allowances (net) |
| house_allow | Housing allowances (net) |
| cap_inv | Interest, dividends, profit from capital investments in unincorporated business (net) |
| tax_adj | Repayments/receipts for tax adjustment (net) |
| ratio_n | Ratios of the population size per area and the total population size |
The generation process for both data sets is extensively described in A.-K. Kreutzmann et al. (2019). Our process is nearly equivalent, but we do not produce out-of-sample domains for the area-level version of the data sets. The Austrian European Union Statistics on Income and Living Conditions (EU-SILC) synthetic 2006 data set ( eusilcP) sourced from the simFrame package (Alfons, Templ, and Filzmoser 2010) serves as basis for our data sets. The lowest regional level in the eusilcP data set consists of the nine Austrian states. Based on certain population size and income criteria, households were allocated to 94 Austrian districts resulting in the synthetic population data set ‘eusilcA_pop’. For the ‘eusilcA_smp’ data set, a sample was drawn following a stratified random sampling process using the districts as strata. To show the usage of the FH model and its extensions, area-level data is required. The area-level survey and population data sets, ‘eusilcA_smpAgg’ and ‘eusilcA_popAgg’, are obtained by aggregation on the district level with the help of the direct function defined by the emdi package. The direct estimates in ‘eusilcA_smpAgg’ are the weighted mean equivalized household income Mean, the ratio of households that earn more than the national median income ( MTMED) and their variances. These are based on the equivalized household income eqIncome in ‘eusilcA_smp’, defined as the total income of a household divided by the size of the household, with household size equalized by the modified equivalence scale of the Organisation for Economic Co-operation and Development (OECD) (Hagenaars, Vos, and Zaidi 1994). Additionally, the mean of the variable cash, its variance and the sample sizes are included in ‘eusilcA_smpAgg’, being required by the model extensions. The population data set ‘eusilcA_popAgg’ contains a variety of variables that describe different income sources of households, and a variable ratio_n that describes the ratios of the population sizes per area and the total population size. The variable Domain exists in both data sets and identifies the different districts. Both data sets have an observation for each of the 94 Austrian districts, with the sample data set ‘eusilcA_smpAgg’ containing eight variables and the population data set ‘eusilcA_popAgg’ containing fifteen. Table 1 provides an overview of all included variables of the sample and population data sets. For the creation of the proximity matrix used in the spatial FH model and the usage of the map_plot function, a shape file is needed. A shape file ‘shape_austria_dis’ ( .rda format, "SpatialPolygonsDataFrame") for the 94 districts of Austria is provided. This file was sourced from the SynerGIS website (Bundesamt für Eich- und Vermessungswesen 2017). The data set ‘eusilcA_prox’, an example proximity matrix, has also been added to the emdi package. The creation of ‘eusilcA_prox’ is described in the following section.
While the theoretical background of the implemented area-level models has been introduced in the section , the focus of this section lies on the functionality and the workflow of their usage in R. All of the contained area-level models can be applied by one function: fh. Table \[tab:inputfh\] gives an overview of the 20 input arguments of function fh, together with a short description and any default settings.
B X n Argument & Description & Default
fixed &
Formula of fixed-effects part of linear mixed model &
vardir &
Domain-specific sampling variances of the direct estimates&
combined_data & Combined sample and census data set &
domains &
Domain indentifier for combined_data & NULL
method & Model fitting method & reml
interval &
Lower and upper limit for the variance estimation & NULL
k & Tuning constant for robust estimation & 1.345
mult_constant &
Bias correction multiplier constant for robust estimation &
1
transformation & Type of transformation & no
backtransformation & Type of back-transformation & NULL
eff_smpsize & Effective sample sizes for the arcsin transformation
& NULL
correlation & Correlation of random effects & no
corMatrix & Proximity matrix for the spatial model & NULL
Ci &
Array of the variance-covariance matrix of the explanatory variables
for each area for the ME model & NULL
tol &
Tolerance value for the variance estimation & 0.0001
maxit &
Maximum number of iterations for the variance estimation &
100
MSE & MSE estimation & FALSE
mse_type & Type of MSE estimator & analytical
B &
Numbers of bootstrap iterations for computation of a bootstrap MSE
and information criteria by Marhuenda, Morales,
and Camen Pardo (2014) & c(50,0)
seed & Seed for random number generator & 123
Not every argument needs a specification for every estimated model. Depending on the area-level model, different arguments have to be determined (see Table \[tab:inputarg\] in Appendix A). Figure 1 demonstrates the estimation possibilities of a standard FH model (for the extended area-level models see Figure 16 in Appendix A).
In line with the direct and ebp functions of package version 1.1.7, the S3 object system is used for function fh (J. Chambers and Hastie 1992). All three return objects of class "emdi". The application of function direct leads to a "direct" object, and of functions ebp and fh to objects of classes "ebp" and "fh", respectively. Though all of the returned objects contain ten components, not every component is available for each estimation method. In these cases they are indicated as NULL (see Table \[tab:components\]). Furthermore, the model component differs for the two classes "ebp" and "fh". The components of objects of class "fh" are provided in Table \[tab:modelcomp\] in Appendix B. Not all of the components are available for every area-level model, e.g., the shrinkage factors per domain are not provided for the spatial and robust model extensions, as they do not have an intuitive interpretation in those cases. Due to the consistent structure, all functions and methods of emdi version 1.1.7 can be applied to objects of class "fh". Additionally, new functions and methods are available for the area-level models. Furthermore, a variety of methods that are available in base R and used by other model fitting R packages are included in the latest package version 2.1.3 for the different "emdi" objects. Two examples of the new generic functions used are coef and logLik. Figure 2 demonstrates the steps of a full data analysis procedure and the respective functions, from model building and diagnostics to presenting the results. The section demonstrates the procedure shown in Figure 2 by applying the standard FH model to the Austrian EU-SILC data described in the section . To demonstrate how the different extended area-level models are fitted with function fh, the section follows.
@z @X @X @g @g @g & Name & Description &
& & & direct & ebp & fh
& ind & Point estimates per area & \(\surd\) & \(\surd\) & \(\surd\)
2 & MSE &
Variance/MSE estimates per area & \(\surd\) & \(\surd\) & \(\surd\)
3 & transform_param &
Transformation and shift parameters & & \(\surd\) &
4 & model & Fitted model & & \(\surd\) & \(\surd\)
5 & framework & List for data description& \(\surd\) & \(\surd\) & \(\surd\)
6 & transformation & Type of transformation & & \(\surd\) & \(\surd\)
7 & method & Estimation method & & \(\surd\) & \(\surd\)
8 & fixed & Formula of fixed effects & & \(\surd\) & \(\surd\)
9 & call & Function call & \(\surd\) & \(\surd\) & \(\surd\)
10 & successful_bootstraps &
Number of successful bootstraps & \(\surd\) & & \(\surd\)
The aim of this example is to estimate the equivalized income for the 94 Austrian districts. The package and the example data sets are loaded as follows:
> library("emdi") > data("eusilcA_popAgg") > data("eusilcA_smpAgg")
Combine input data
The function fh requires one data set (argument combined_data) that
comprises the sample and population data. Thus, the data set must
contain all variables of the formula object fixed, the variances of the
direct estimates and, optionally, a domain identifier. For cases where
the sample and population data are only available separately, a merging
function combine_data is provided. The necessary arguments are the two
data sets and characters specifying the domain indicator for the
respective data sets.
> combined_data <- combine_data( + pop_data = eusilcA_popAgg, pop_domains = "Domain", + smp_data = eusilcA_smpAgg, smp_domains = "Domain")
Identify spatial structures
With the help of a proximity matrix, Moran’s I and Geary’s C test
statistics can be computed to identify spatial structures by the
spatialcor.tests command. For the creation of the proximity matrix, the
shapefile must be loaded. We load the Austrian shapefile that is
provided and merge it to the sample data set by using the respective
domain identifiers with the help of the merge method from the sp package (Pebesma and Bivand 2005). Before merging, we
sort the Austrian shapefile by the domains in the sample data.
> library("sp") > load_shapeaustria() > shape_austria_dis <- shape_austria_dis\[ + order(shape_austria_dis\$PB),\] > austria_shape <- merge(shape_austria_dis, + eusilcA_smpAgg, by.x = "PB", by.y = "Domain", + all.x = F)
Then the poly2nb and nb2mat functions of the spdep package (Bivand and Wong 2018) are used. While poly2nb generates a list of neighbours that share joint boundaries, nb2mat computes a weights matrix. The style argument has to be set to W, as a row standardized proximity matrix is required.
> library("spdep") > rel <- poly2nb(austria_shape, + row.names = austria_shape\(PB) > eusilcA_prox <- nb2mat(rel, style = "W", + zero.policy = TRUE) \end{example} Thus, a row standardized proximity matrix is generated that initially had weights of one if an area shares a boundary with another area and zero if not. Function \texorpdfstring% {{\normalfont\ttfamily\hyphenchar\font=-1 spatialcor.tests}}% {spatialcor.tests} makes use of the \texorpdfstring% {{\normalfont\ttfamily\hyphenchar\font=-1 moran.test}}% {moran.test} and \texorpdfstring% {{\normalfont\ttfamily\hyphenchar\font=-1 geary.test}}% {geary.test} functions with their respective default settings, from the \texorpdfstring% {{\normalfont\fontseries{b}\selectfont spdep}}% {spdep} package. The input arguments are the created matrix and the direct estimates. \begin{example} > spatialcor.tests(direct = combined_data\)Mean, + corMatrix = eusilcA_prox)
Statistics Value p.value 1 Moran’s I 0.2453677 5.607958e-05 2 Geary’s C 0.6238681 2.473294e-03
Since the output indicates only a weak positive spatial autocorrelation, the following estimation procedure does not consider the integration of a correlation structure for the random effects.
Perform model selection
Besides theoretical considerations on which auxiliary variables should
be part of the model, the decision for the best model should be based on
information criteria like the Akaike or Bayesian information criterion
(AIC, BIC). Many applications use selection techniques based on linear
regression (Casas-Cordero, Encina, and Lahiri
2016; Schmid et al. 2017). Instead, the emdi package provides the
AIC, BIC, Kullback information criterion (KIC) and their bootstrap and
bias corrected versions (AICc, AICb1, AICb2, KICc, KICb1, KICb2)
especially developed for FH models by Marhuenda,
Morales, and Camen Pardo (2014). These criteria are also included
in the sae package, but the emdi package enables a stepwise variable
selection procedure based on the chosen information criteria, comparable
to the step function for lm models of package stats. The most important
input arguments are an object of class "fh" and the direction of the
stepwise search ( both, backward, forward). In this example, the default
setting backward and the KICb2 information criterion is used. In the
fixed argument of the fh function, the variables employee cash ( cash),
cash benefits from self-employment ( self_empl) and unemployment
benefits ( unempl_ben) are included. For a valid comparison of models
based on information criteria, the model fitting method must be ml. To
activate the estimation of the information criteria by Marhuenda, Morales, and Camen Pardo (2014), we
set the number of bootstrap iterations to 50. The output shows the
stepwise removal of variables until the lowest KICb2 is reached, the
function call and an overview of the estimated coefficients of the final
recommended model.
> fh_std <- fh(fixed = Mean cash + self_empl + unempl_ben, vardir = "Var_Mean", + combined_data = combined_data, domains = "Domain", method = "ml", B = c(0,50)) > step(fh_std, criteria = "KICb2")
Start: KICb2 = 1709.42 Mean cash + self_empl + unempl_ben
df KICb2 - unempl_ben 1 1708.3 <none> 1709.4 - self_empl 1 1763.0 - cash 1 1808.6
Step: KICb2 = 1708.33 Mean cash + self_empl
df KICb2 <none> 1708.3 - self_empl 1 1765.3 - cash 1 1816.1
Call: fh(fixed = Mean cash + self_empl, vardir = "Var_Mean", combined_data = combined_data, domains = "Domain", method = "ml", B = c(0, 50))
Coefficients: coefficients std.error t.value p.value (Intercept) 3070.51231 635.94290 4.8283 1.377e-06 *** cash 1.05939 0.07049 15.0288 < 2.2e-16 *** self_empl 1.74564 0.22017 7.9284 2.219e-15 *** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ’ ’ 1
KICb2 is minimised when the variable unempl_ben is removed.
Therefore, the formula Mean \(\sim\)
cash + self_empl is used in the following.
Estimate EBLUPs and MSEs
The standard FH model is built. In addition to the fixed part, required
arguments are vardir and combined_data. We specify the domains (if the
domains argument is set to NULL, the domains are numbered consecutively)
and activate the estimation of the MSE and of the information criteria
by Marhuenda, Morales, and Camen Pardo
(2014).
> fh_std <- fh(fixed = Mean cash + self_empl, vardir = "Var_Mean", combined_data = + combined_data, domains = "Domain", method = "ml", MSE = TRUE, B = c(0,50))
Assess the estimated model
In many publications using FH models, model diagnostics are discussed
only briefly, if at all. One reason for this might be the lack of an
existing implementation of desired diagnostic measures in R or other
statistical software. The summary method of emdi provides additional
information about the data and model components, in particular the
chosen estimation methods, the number of domains, the log-likelihood,
the information criteria by Marhuenda, Morales,
and Camen Pardo (2014), the adjusted \(R^2\) of a standard linear model and the
adjusted \(R^2\) especially for FH
models proposed by Lahiri and Suntornchost
(2015). Additionally, measures to validate model assumptions
about the standardized realized residuals and the random effects are
provided: skewness and kurtosis ( skewness and kurtosis of package moments, ) of the
standardized realized residuals and the random effects and the test
statistics with corresponding \(p\) value of the Shapiro-Wilks-test for
normality of both error terms. As the introduced area-level models do
not assume a homoscedastic sampling distribution, the realized residuals
(\(\hat{e}_i\)) are standardized by
\(\hat{e}_i^{\text{std}} = \hat{e}_i /
\sigma_{e_{i}}\) for the summary and plot methods. The summary
output differs slightly for the different implemented area-level models.
For example, log-likehoods and thus information criteria are not
available in theory for the robust and the ME model.
> summary(fh_std)
Call: fh(fixed = Mean cash + self_empl, vardir = "Var_Mean", combined_data = combined_data, domains = "Domain", method = "ml", MSE = TRUE, B = c(0, 50))
Out-of-sample domains: 0 In-sample domains: 94
Variance and MSE estimation: Variance estimation method: ml Estimated variance component(s): 1371195 MSE method: datta-lahiri
Coefficients: coefficients std.error t.value p.value (Intercept) 3070.51231 635.94290 4.8283 1.377e-06 *** cash 1.05939 0.07049 15.0288 < 2.2e-16 *** self_empl 1.74564 0.22017 7.9284 2.219e-15 *** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ’ ’ 1
Explanatory measures: loglike AIC AICc AICb1 AICb2 BIC KIC 1 -847.8303 1703.661 1703.91 1715.758 1703.461 1713.834 1707.661 KICc KICb1 KICb2 AdjR2 FH_R2 1 1708.783 1720.632 1708.335 0.9212817 0.9482498
Residual diagnostics: Skewness Kurtosis Shapiro_W Shapiro_p Standardized_Residuals 0.3004662 3.971216 0.9840810 0.3119346 Random_effects -0.4113238 3.086048 0.9839858 0.3072834
Transformation: No transformation
The output of the example shows that all domains have survey information and the variance of \(\sigma_u^2\) amounts to 1371195. Further, all of the included auxiliary variables are quite significant and their explanatory power is large with an adjusted \(R^2\) (for FH models) of around 0.95. The results of the Shapiro-Wilk-test indicate that normality is not rejected for both errors. Graphical residual diagnostics are possible by the plot method.
> plot(fh_std)
Figure 6 shows normal quantile-quantile (Q-Q) plots of the standardized realized residuals and random effects (Figure 3) as well as plots of the kernel densities of the distribution of both error terms and, for comparison, a standard normal distribution (Figure 4 and 5). Like in emdi version 1.1.7, the user is free to modify the interface of the plots. The label and color arguments are easy to edit. Additionally, the overall appearance of the plots are changeable by the gg_theme argument, as the plots are built with the ggplot2 package (Wickham 2016). We refer to the package documentation for a detailed description of how to customize the plot arguments. Figure 6 supports the results of the normality tests provided in the summary output, the distribution of the standardized random effects may be slightly skewed (Figure 5). If one is not be satisfied with the results, applying a log-transformation could improve the distribution of the error terms.
Compare results with direct estimates
The FH results should be consistent with the direct estimates for
domains with a small direct MSE and/or large sample sizes. Further, the
precision of the direct estimates should be improved by using auxiliary
information. The comparison of the direct and model-based (FH) estimates
can be done graphically by the generic function compare_plot. For the fh
method the required input argument is an object of class "fh". When the
default settings of the command are used, the output consists of two
plots: a scatter plot proposed by Brown et al.
(2001) and a line plot. Besides the direct and FH estimates, the
plot contains the fitted regression and the identity line. These two
lines should not differ too much. Preferably, the model-based (FH)
estimates should track the direct estimates within the line plot
especially for domains with a large sample size or small MSE of the
direct estimator. The points are ordered by decreasing MSE of the direct
estimates. In addition, the input arguments MSE and CV can be set to
TRUE leading to two extra plots, respectively. The MSE/CV estimates of
the direct and model-based (FH) estimates are compared first via
boxplots and second via ordered scatter plots (ordered by increasing CV
of the direct estimates). Like for the plot command, a variety of
customization options are offered, e.g., the label options ( label), the
format of the points ( shape) and the style of the line (
line_type).
> compare_plot(fh_std, CV = TRUE, label = "no_title")
Except one high value, the fitted regression and identity line of the scatter plot (Figure 7) are relatively close. Note that the high value corresponds to the domain Eisenstadt (Stadt) with a very small sample size of 10 and the highest MSE of the direct estimates, so the direct estimator is very uncertain. Also the direct estimates are well tracked by the model-based (FH) estimates within the line plot (Figure 8). The boxplot (Figure 9) and the ordered scatter plot (Figure 10) show that the precision of the direct estimates could be improved by the usage of the FH model in terms of CVs. Additionally, all of the CV values are less than 20% which is a common rule of the UK Office for National Statistics in order to determine whether estimation results should be published (Miltiadou 2020).
Further on, the function compare enables the user to compute a goodness of fit diagnostic (Brown et al. 2001) and a correlation coefficient of the direct estimates and the estimates of the regression-synthetic part of the FH model (H. Chandra, Salvati, and Chambers 2015). Following Brown et al. (2001), the difference between the model-based estimates and the direct estimates should not be significant (null hypothesis). The Wald test statistic is specified as \[W\left(\hat{\theta}_i^{\text{FH}}\right) = \sum_{i = 1}^{D} \frac{\left(\hat{\theta}_i^{\text{Dir}}- \hat{\theta}_i^{\text{FH}}\right)^2}{\widehat{\text{var}}\left(\hat{\theta}_i^{\text{Dir}}\right) + \widehat{\text{MSE}}\left(\hat{\theta}_i^{\text{FH}}\right)}\] and is approximately \(\chi^2\)-distributed with \(D\) degrees of freedom. When working with out-of-sample domains, those are not taken into account, because the direct estimates and their variances are missing. The input argument of function compare is an "fh" object.
> compare(fh_std)
Brown test
Null hypothesis: EBLUP estimates do not differ significantly from the direct estimates
W.value Df p.value 46.97181 94 0.9999874
Correlation between synthetic part and direct estimator: 0.94
The results of the goodness of fit statistic and the correlation
coefficient confirm what the scatter and the line plot already
indicated. In the example the null hypothesis is not rejected and the
correlation coefficient indicates a strong positive correlation (0.94)
between the direct and model-based (FH) estimates.
Benchmarking for consistent estimates
The idea of benchmarking is that the aggregated FH estimates should sum
up to estimates of a higher regional level (\(\tau\)): \[\sum_{i=1}^{D} \xi_{i}
\hat{\theta}_{i}^{\text{FH, bench}} = \tau,\] where \(\xi_{i}\) stands for the share of the
population size of each area in the total population size (\(N_{i}/N\)). In our example, the EBLUP
estimates could get aggregated on a national level and then compared to
or benchmarked with the Austrian mean equivalized income. The emdi
package contains a benchmark function that allows the user to select
three different options suggested by Datta et al.
(2011). A general estimator of the three options can be written
as follows: \[\hat{\theta}_{i}^{\text{FH,
bench}} = \hat{\theta}_{i}^{\text{FH}} + \left( \sum_{i=1}^{D}
\frac{\xi_{i}^2}{\phi_{i}} \right)^{-1} \left( \tau - \sum_{i=1}^{D}
\xi_{i}\hat{\theta}_{i}^{\text{FH}} \right)
\frac{\xi_{i}}{\phi_{i}}.\] Depending on the weight \(\phi_{i}\), the formula leads to different
benchmarking options. If \(\phi_{i}\)
equals \(\xi_{i}\), all FH estimates
are adjusted by the same value ( raking). A ratio adjustment ( ratio) is
being conducted if \(\phi_{i} =
\xi_{i}/\hat{\theta}_{i}^{\text{FH}}\). For the last option (
MSE_adj), \(\phi_{i} =
\xi_{i}/{\widehat{\text{MSE}}\left(\hat{\theta}_{i}^{\text{FH}}\right)}\).
While the first option is a relatively naive approach, the latter two
conduct larger adjustments for the areas with larger FH and MSE
estimates, respectively. Thus, for the benchmark function the following
arguments have to be specified: an object of class "fh", a benchmark
value, a vector containing the \(\xi_{i}\)s ( share) and the type of
benchmarking. The output is a data frame with an extra column FH_Bench
for the benchmarked EBLUP values. If the optional argument overwrite is
set to TRUE, the benchmarked results are added to the "fh" object and
the MSE estimates of the non benchmarked FH estimates are set to NULL.
For the used example, the benchmark value is calculated by taking the
mean of the variable eqIncome of the ‘eusilcA_smp’ data frame. The \(\xi_{i}\)s can be found in ‘eusilcA_popAgg’ as ratio_n.
> fh_bench <- benchmark(fh_std, benchmark = 20140.09, share = eusilcA_popAgg$ratio_n, + type = “ratio”) > head(fh_bench)
Domain Direct FH FH_Bench Out
1 Amstetten 14768.57 14242.04 14480.61 0 2 Baden 21995.72 21616.40 21978.49 0 3 Bludenz 12069.59 12680.38 12892.79 0 4 Braunau am Inn 10770.53 11925.82 12125.59 0 5 Bregenz 35731.20 32101.69 32639.43 0 6 Bruck-Mürzzuschlag 23027.37 22523.50 22900.79 0 \end{example} It is recognizable that for the first six Austrian districts the original estimates are slightly modified by the benchmarking. \ \ To gain an overview of the point, MSE and CV results of the direct estimates compared to the model-based (FH) results the generic function can be used, but differences among areas or hotspots of special interest are usually easier to detect on maps. With function , the package offers a user-friendly way to produce maps since creating maps can often become a time consuming task. The input arguments mainly consist of an object of class and a spatial polygon of a shape file. The only issue that might come up is if domain identifiers in the data do not match to the respective identifiers of the shape file. In those cases, the input argument is required, which is a data frame that contains the matching of the domain indentifiers of the population and the shape file data sets. For detailed instructions, we refer to and to the help page of function .
For producing maps of the 94 Austrian districts, the Austrian shape file has to be loaded. In addition to the object, the object () and a domain indicator () have to be specified. The argument is not necessary since the identifiers match in our example. To allow for an easier comparison of the results, we adjust the scales of the maps using the argument.Figures~\(\ref{fig:mapa}\) and~\(\ref{fig:mapc}\) show the distribution of
the estimated (direct vs. model-based) equivalized income across
Austria. It is striking that white and light red tones dominate the map,
indicating relatively low mean incomes of the districts. But in
contrast, districts Eisenstadt (Stadt), Urfahr-Umgebung and M"odling
stand out having the largest incomes. Urfahr-Umgebung is also
eye-catching when having a look at the MSE estimates (Figures~\(\ref{fig:mapb}\) and~\(\ref{fig:mapd}\)). The MSE of the direct
and the FH estimates are quite high. Probably a single wealthy household
raised the mean income and also the variance. Figure~\(\ref{fig:mapb}\) contains some districts
with MSEs larger than the customized scaling (gray areas). Without the
scaling it would have been hard to identify any differences in
Figure~\(\ref{fig:mapd}\).
\ \ Some users might have an interest to store the results separately or
to use them for presentations. Excel and OpenDocument Spreadsheets
provide many opportunities for that. In contrast to some existing R
packages, the functions / do not only export the estimation results, but
also the output of . Usage of the functions is comprehensively described
in .
\ If other data sources than register data, e.g., data from larger surveys or big data sources are used as auxiliary information, the ME model should be applied. For the estimation of the ME model, the model fitting method must be set to and the only possible MSE estimation method is . The most complex input argument consists of the creation of the MSE array . The variability of the auxiliary variables that is taken into account by the ME model is expressed by the variance-covariance matrices per domain (). For example, for three covariates a, b and c the array should look like % \[\begin{equation*} \boldsymbol{C}_i = \left( \begin{array}{rrrr} 0 & 0 & 0 & 0 \\ 0 & \text{var}_i(a) & \text{cov}_i(a,b) & \text{cov}_i(a,c) \\ 0 & \text{cov}_i(a,b) & \text{var}_i(b) & \text{cov}_i(b,c) \\ 0 & \text{cov}_i(a,c) & \text{cov}_i(b,c) & \text{var}_i(c) \\ \end{array}\right), \text{ } i = 1,...,D. \end{equation*}\] % The first row and column contain zeros, because the intercept is considered. The variances and covariances can be computed by standard approaches like, for example, the Horvitz-Thompson estimator.
For the Austrian EUSILC data example, the equalized income can also be explained by a variable of the sample data set. The code below demonstrates how the MSE array is created for one covariate (variable and its variance ) and how the final ME model is built. \begin{example} > P <- 1 > M <- 94 > Ci_array <- array(data = 0, dim = c(P + 1, P + 1, M)) > Ci_array[2,2, ] <- eusilcA_smpAgg$Var_Cash
> fh_yl <- fh(fixed = Mean Cash, vardir = "Var_Mean", + combined_data = eusilcA_smpAgg, domains = "Domain", method = "me", + Ci = Ci_array, MSE = TRUE, mse_type = "jackknife")
In this paper, we have presented how the emdi package version 1.1.7 has been extended with various area-level models. Along with the well-known FH model, adjusted variance estimation methods and transformation options are offered to the user. In addition, spatial, robust, and ME model extensions of the standard model allow the user to address various issues that arise in practical data applications. All of these methods can be estimated conveniently by using a single function that provides EBLUP and the respective MSE estimates to measure their precision. Especially in the section , it is clear that the package does not only contain tools for estimation of the different SAE models. Instead, it additionally provides user-friendly tools to enable a whole data analysis procedure: 1. starting with model building and estimation, moving on to 2. model assessment and diagnostics, 3. presentation of the results, and finishing with 4. exporting the results to Excel or OpenDocument Spreadsheet.
For future package versions, it is planned to expand the options in the field of area-level models. In some practical applications, the incorporation of random effects is redundant. Therefore, an area-level estimator that considers a preliminary testing for the random effects following Molina, Rao, and Datta (2015) will be included. Since version 2.0.0 emdi accounts for spatial structures of the random effects. Future developments may also account for out-of-sample EBLUP and MSE estimation for the spatial model proposed by Saei and Chambers (2005) and for temporal and spatio-temporal extensions (Rao and Yu 1994; Marhuenda, Molina, and Morales 2013). For the existing ME model, a bootstrap MSE estimation option may be added to the package since the Jackknife MSE estimator may produce negative MSE estimates (Marchetti et al. 2015). Furthermore, cross-validation options additional to the model assessment via information criteria and the \(R^2\) will be investigated.
The work of Kreutzmann and Schmid has been supported by the German Research Foundation within the project QUESSAMI (281573942) and by the MIUR-DAAD Joint Mobility Program (57265468). The numerical results are not official estimates and are only produced for illustrating the methods. The authors are indebted to the Editor-in-Chief, Associate Editor and the referees for comments that significantly improved the article.
max width=, scale=.99
max width=, scale=.99
width=0.92
@*1l @C @C @C @C @C Argument &
& Standard & Transformed & Spatial & Robust &
ME
fixed & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
vardir & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
combined_data & \(\surd\) &
\(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
domains & (\(\surd\)) & (\(\surd\)) & (\(\surd\)) & (\(\surd\)) & (\(\surd\))
method & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
interval & (\(\surd\)) & (\(\surd\)) & & &
k & & & & \(\surd\)
&
mult_constant & & & & \(\surd\) &
transformation & \(\surd\) &
\(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
backtransformation & & \(\surd\) & & &
eff_smpsize (only if & & \(\surd\) & & &
transformation = "arcsin") & & & & &
correlation & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
corMatrix (only if & & & \(\surd\) & \(\surd\) &
correlation = "spatial") & & & & &
Ci & & & & & \(\surd\)
tol & & & \(\surd\) &
\(\surd\) & \(\surd\)
maxit & & & \(\surd\) &
\(\surd\) & \(\surd\)
MSE & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
mse_type (only if MSE = TRUE) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
B & (\(\surd\)) & \(\surd\) & \(\surd\) & \(\surd\) &
seed & (\(\surd\)) & (\(\surd\)) & (\(\surd\)) & (\(\surd\)) &
width=0.92
@X @X @g @f @g @g @s Name & Short description & Available
for
& & Standard & Transformed & Spatial & Robust &
ME
coefficients &
Estimated regression coefficients & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
variance &
Estimated variance of the random effects/ estimated spatial
correlation parameter & \(\surd\)
& \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
random_effects &
Random effects per domain & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
real_residuals &
Realized residuals per domain & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
std_real_residuals &
Standardized realized residuals per domain & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
gamma &
Shrinkage factors per domain & \(\surd\) & \(\surd\) & & & \(\surd\)
model_select &
Model selection and accuracy criteria & \(\surd\) & \(\surd\) & \(\surd\) & &
correlation &
Selected correlation structure of the random effects & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\)
k &
Tuning constant & & & & \(\surd\) &
mult_constant &
Multiplier constant for bias correction & & & & \(\surd\) &
seed &
Seed of the random number generator & \(\surd\) & \(\surd\) & \(\surd\) & \(\surd\) &
For the computation of the results in this paper we worked with R version 4.2.2 on a 64-bit platform under Microsoft Windows 10 with the installed packages listed in Table 2. Using the package packrat (Ushey et al. 2022) a snapshot of the corresponding repository was created that is available from the GitHub folder (https://github.com/SoerenPannier/emdi.git). We suggest the following steps:
Install Git.
Create a new project in RStudio.
Choose checkout from version control and select Git.
Insert the repository URL: https://github.com/SoerenPannier/emdi.git.
Let packrat complete the initialization process.
Restart RStudio.
Enter the R command packrat::restore().
After finishing the installation process all packages are installed as provided in Table 2.
| Package | Version | Package | Version | Package | Version |
|---|---|---|---|---|---|
| aoos | 0.5.0 | highr | 0.9 | RColorBrewer | 1.1-3 |
| assertthat | 0.2.1 | HLMdiag | 0.5.0 | Rcpp | 1.0.9 |
| backports | 1.4.1 | hms | 1.1.1 | RcppArmadillo | 0.11.2.0.0 |
| BBmisc | 1.12 | isoband | 0.2.5 | readODS | 1.7.0 |
| bit | 4.0.4 | janitor | 2.1.0 | readr | 2.1.2 |
| bit64 | 4.0.5 | jsonlite | 1.8.0 | rematch | 1.0.1 |
| boot | 1.3-28 | knitr | 1.39 | rematch2 | 2.1.2 |
| brew | 1.0-7 | labeling | 0.4.2 | reshape2 | 1.4.4 |
| brio | 1.1.3 | laeken | 0.5.2 | rgeos | 0.5-9 |
| cachem | 1.0.6 | lifecycle | 1.0.1 | rlang | 1.0.4 |
| callr | 3.7.1 | lubridate | 1.8.0 | roxygen2 | 7.2.1 |
| cellranger | 1.1.0 | magrittr | 2.0.3 | rprojroot | 2.0.3 |
| checkmate | 2.1.0 | maptools | 1.1-4 | s2 | 1.1.0 |
| classInt | 0.4-7 | MASS | 7.3-58 | saeRobust | 0.3.0 |
| cli | 3.3.0 | memoise | 2.0.1 | scales | 1.2.0 |
| clipr | 0.8.0 | modules | 0.10.0 | sf | 1.0-8 |
| colorspace | 2.0-3 | moments | 0.14.1 | simFrame | 0.5.4 |
| commonmark | 1.8.0 | MuMIn | 1.47.1 | snakecase | 0.11.0 |
| cpp11 | 0.4.2 | munsell | 0.5.0 | sp | 1.5-0 |
| crayon | 1.5.1 | nlme | 3.1-158 | spData | 2.0.1 |
| data.table | 1.14.2 | openxlsx | 4.2.5 | spdep | 1.2-4 |
| DBI | 1.1.3 | operator.tools | 1.6.3 | stringi | 1.7.8 |
| deldir | 1.0-6 | packrat | 0.8.1 | stringr | 1.4.0 |
| desc | 1.4.1 | parallelMap | 1.5.1 | terra | 1.5-34 |
| diagonals | 6.4.0 | pbapply | 1.5-0 | testthat | 3.1.4 |
| diffobj | 0.3.5 | pillar | 1.8.0 | tibble | 3.1.8 |
| digest | 0.6.29 | pkgconfig | 2.0.3 | tidyr | 1.2.0 |
| dplyr | 1.0.9 | pkgload | 1.3.0 | tidyselect | 1.1.2 |
| e1071 | 1.7-11 | plyr | 1.8.7 | tzdb | 0.3.0 |
| ellipsis | 0.3.2 | praise | 1.0.0 | units | 0.8-0 |
| emdi | 2.1.3 | prettyunits | 1.1.1 | utf8 | 1.2.2 |
| evaluate | 0.15 | processx | 3.7.0 | vctrs | 0.4.1 |
| fansi | 1.0.3 | progress | 1.2.2 | viridisLite | 0.4.0 |
| farver | 2.1.1 | proxy | 0.4-27 | vroom | 1.5.7 |
| fastmap | 1.1.0 | ps | 1.7.1 | waldo | 0.4.0 |
| formula.tools | 1.7.1 | purrr | 0.3.4 | withr | 2.5.0 |
| fs | 1.5.2 | R.cache | 0.16.0 | wk | 0.6.0 |
| generics | 0.1.3 | R.methodsS3 | 1.8.2 | xfun | 0.31 |
| ggplot2 | 3.3.6 | R.oo | 1.25.0 | xml2 | 1.3.3 |
| ggrepel | 0.9.1 | R.rsp | 0.45.0 | yaml | 2.3.5 |
| glue | 1.6.2 | R.utils | 2.12.0 | zip | 2.2.0 |
| gridExtra | 2.3 | R6 | 2.5.1 | ||
| gtable | 0.3.0 | raster | 3.5-21 |