Abstract
Different inference procedures are proposed in the literature to correct selection bias that might be introduced with non-random sampling mechanisms. The R package NonProbEst enables the estimation of parameters using some of these techniques to correct selection bias in non-probability surveys. The mean and the total of the target variable are estimated using Propensity Score Adjustment, calibration, statistical matching, model-based, model-assisted and model-calibratated techniques. Confidence intervals can also obtained for each method. Machine learning algorithms can be used for estimating the propensities or for predicting the unknown values of the target variable for the non-sampled units. Variance of a given estimator is performed by two different Leave-One-Out jackknife procedures. The functionality of the package is illustrated with example data sets.Since sampling theory was formalized in the beginning of the 20th century, surveys have been the main tool to obtain information from society and nature. Traditional surveys used telephone or face-to-face interviews for questionnaire administration, as well as mailing lists. However, the increase of costs, linked to the decrease in response rates, and the development of information and communication technologies have favored the use of new survey modes such as online or smartphone questionnaires. These modes make the sampling process cheaper and faster, but tend to amplify bias from several sources. More precisely, online surveys are often performed through a non-probability sampling, using self-selection procedures without a defined sampling frame where the inclusion probabilities are known or with deficient sampling frames with coverage issues, leading to higher levels of selection bias (Elliott and Valliant 2017).
Some techniques can be used to correct selection bias in online non-probability surveys. A good overview of the various methods is given in (Elliott and Valliant 2017). There are three important approaches: the pseudo-design based inference (or pseudo-randomisation (Buelens, Burger, and Brakel 2018)), statistical matching and predictive inference.
In the pseudo-design based inference, the idea is to construct weights to correct for selection bias. The first method is estimating response probabilities and using them in Horvitz-Thompson or Hajek type estimators to account for unequal selection probabilities. The most used method to estimate response probabilities is Propensity Score Adjustment (see e.g. (Lee and Valliant 2009)). This method uses a probability reference sample in addition to a non-probability convenience sample to construct a response propensity model. Sample matching is another approach also applied to tackle selection bias. A predictive model, with the target variable as the dependent variable, is built using data from the non-probability sample. This model is subsequently applied to a probability sample (where the target variable is not measured) to predict values of its individuals for an estimation of the population values. Similarly, predictive methods are based on superpopulation models. In this approach, a predictive model is fitted for the analysis variable from the sample and used to project the sample to the full population. This approach (that can be used with probability and non-probability samples) allows researchers to use the auxiliary information about covariates in different methods for predicting the unknown values. Most of these methods require special software for their implementation. The package NonProbEst implements some of these techniques.
The paper is structured as follows. First, we introduce the notation used throughout the paper and we discuss the different ways to do inference for non-probability surveys. In section 3 we briefly comment on the usefulness of Machine Learning (ML) Techniques in this context. Then, we describe the R package NonProbEst. In section 5 we briefly describe the use of the functions, including suitable examples, for each method.
Let \(U\) denote a finite population with \(N\) units, \(U = \left\{1,\ldots,k,\ldots,N\right\}\). Let \(s_V\) be a volunteer non-probability sample of size \(n_{V}\), self-selected from an online population \(U_V\) which is a subset of the total target population \(U\). Let \(y\) be the variable of interest in the survey estimation. Without any auxiliary information, the population total of \(y\), \(Y\), is usually estimated with the following Horvitz-Thompson type estimator: \[\label{eq1} \hat{Y}_{HT}=\sum_{k \in s_V} w_{vk} y_k (\#eq:eq1)\] being \(w_{vk}\) a weight of the unit \(k\) set by the researcher to adjust the lack of response, lack of coverage, voluntariness, ... (e.g. by means of post-stratification). A simple choice is \(w_{vk} = N/n_V\), that is, consider the sample of volunteers as if it was obtained with a simple random sampling design of the population \(U\).
This estimator has a bias induced by various mechanisms regarding their application. The most important are the selection bias (due to the difference between sampled and nonsampled individuals on the probability to participate in a survey) and the coverage bias (the online population \(U_v\) is not the same of the target population \(U\)).
The key to successful weighting to remove the bias in non-probability surveys lies in the use of powerful auxiliary information. Auxiliary information can be available in different forms. We distinguish three different cases, called InfoTP, InfoES and InfoEP, depending on the information at hand.
We will now explain the main methods used to treat these biases depending on the type of information that is available.
Let \(\mathbf{x}_k\) be the value taken on unit \(k\) by a vector of auxiliary variables which population total is assumed to be known \(\mathbf {X} = \sum_{k = 1}^N \mathbf{x}_k\). The calibration estimation of \(Y\) consists in the computation of a new vector of weights \(w_k\) for \(k \in s\) which modifies as little as possible the original sample weights, \(w_{vk}\), which have the desirable property of producing unbiased estimations, respecting at the same time the calibration equations
\[\label{eq2}
\sum_{k \in s_V} w_k \mathbf{x}_k =
\mathbf{X}. (\#eq:eq2)\]
Given a pseudo-distance \(G(w_k,w_{vk})\), the calibration process
consists in finding the solution to the minimization problem
\[\label{eq3}
\min_{w_k} {\color{blue}\{\sum_{k\in
s_V}G(w_k,w_{vk})\}} (\#eq:eq3)\]
while respecting the calibration equation (@ref(eq:eq2)). Several
distances were defined in (Deville and Särndal
1992), being the linear distance one of the most commonly used.
The resulting estimator of \(Y\) under
the chi-square distance is the general regression estimator \[\label{eq8}
Y_{reg} = \sum_{s_V} w_k y_k = \sum_{s_V}d_k y_k + (\mathbf{X} -
\sum_{s_V}w_{vk} \mathbf{x}_k )'
\hat{B}_{s_V} (\#eq:eq8)\]
where \(\hat{B}_s\) is \[\label{eq9}
\hat{B}_{s_V} = T_s^{-1} \sum_{s_V} w_{vk} \mathbf{x}_k
y_k (\#eq:eq9)\] being \(T_s=
\sum_{s_V} w_{vk} \mathbf{x}_k \mathbf{x}_k'\).
It is proved in (Bethlehem 2010) that bias can be reduced through calibration only when the non-response due to volunteering has a Missing At Random scheme, while it cannot be equally done in Not Missing at Random situations (which are the most frequent).
The Propensity Score Adjustment method was originally developed by (Rosenbaum and Rubin 1983) which sought to reduce the confounding bias between treatment and control groups in experimental designs. This approach would be considered in sampling research as well in combination with a reference sample (Rubin 1986), but it was not proposed for online surveys until the early 2000’s (Taylor et al. 2001).
It is expected that a sample collected by online recruitment would not follow the principles of a probability sampling, especially in those cases that the survey is filled by volunteer respondents. In such a situation, every individual is associated to a probability of participating in the survey which depends on her or his characteristics.
The propensity for an individual to take part on the non-probability survey is obtained by training a predictive model (often a logistic regression) on the dichotomous variable, \(I_{s_V}\), which measures whether a respondent from the combination of both samples took part in the volunteer survey or in the reference survey. Covariates used in the model, \(\mathbf{x}\), are measured in both samples (in contrast to the target variable which is only measured in the non-probability sample), thus the formula to compute the propensity of taking part in the volunteer survey with a logistic model, \(\pi\), can be displayed as \[\label{logreg} \pi (\mathbf{x}) = \frac{1}{e^{-(\gamma^T \mathbf{x})} + 1} (\#eq:logreg)\] for some vector \(\gamma\), as a function of the model covariates.
We denote by \(s_R\) the reference sample and \(w_{Rk}\) the original design weight of the \(k\) individual in the reference sample
Several options for using the propensity scores in estimation are listed below:
We can use the inverse of the estimated response propensity as a weight for constructing the estimator (Valliant 2019): \[\label{thajek1} \hat{Y}_{PSA1}=\sum_{k \in {s_V}} {w_V}_k y_k /\hat{\pi} (\mathbf{x}_k) = \sum_{k \in {s_V}} y_k w_k^{PSA1} (\#eq:thajek1)\] where \(\hat{\pi} (\mathbf{x}_k)\) is the estimated response propensity for the individual \(k\) of the volunteer sample as predicted using covariates \(\mathbf{x}\).
Alternatively, the approach proposed in (Schonlau and Couper 2017) can be used to obtain weights for a Horvitz-Thompson type estimator using propensity scores. Weights are defined as
\[\label{pesohajek2} w_k^{PSA2} = \frac{1 - \hat{\pi} (\mathbf{x}_k) }{\hat{\pi} (\mathbf{x}_k)} (\#eq:pesohajek2)\] and resulting estimator for the population total is given by \[\label{ypsapost1} \hat{Y}_{PSA2}= \sum_{k \in {s_V}} y_k w_k^{PSA2} (\#eq:ypsapost1)\]
(Valliant and Dever 2011) use the propensity scores to post-stratify the sample. The process is: sort the combined sample by \(\hat{\pi} (\mathbf{x}_k)\); split the combined sample into g classes ( \(g\) = 5 as the conventional choice following (Cochran 1968)), each of which has about the same number of cases in the combined sample; and compute an average propensity, \(\bar{\pi}_g\) within subclass g. Use \(\bar{\pi}_g\) as the weight adjustment for every person in the subclass. Resulting estimator is:
\[\label{thajek3} \hat{Y}_{PSA3}= \sum_g \sum_{k \in {s_V}_g} {w_V}_k y_k /\bar{\pi}_g = \sum_g \sum_{k \in {s_V}_g} y_k w_k^{PSA3} (\#eq:thajek3)\]
Following the approach described in (Lee and Valliant 2009) propensity scores are divided in \(g\) classes, where all units may have the same propensity score or at least be in a very narrow range and an adjustment factor is calculated as:
\[\label{adjfac} f_g = \frac{\sum_{k \in {s_R}_g} w_{Rk} / \sum_{k \in s_R} w_{Rk} }{\sum_{k \in {s_V}_g} w_{Vk} / \sum_{k \in s_V} w_{Vk}} (\#eq:adjfac)\] where \({s_R}_g\) is the set of individuals in the reference sample that are in the \(g\)th class of propensity scores and \({s_V}_g\) is the set of individuals in the volunteer sample that are in the \(g\)th class of propensity scores. Finally, the adjusted weights \(w^{PSA4}\) are the product of the original weights and the adjustment factor; following the same notation, the adjusted weight for individual \(k\) in \({s_V}_g\) (i. e. the individual \(k\) of the \(g\)th propensity class in the volunteer sample) is computed as \[\label{adjwtd} w_k^{PSA4} =w_{Vk} f_g (\#eq:adjwtd)\] and the estimator is given by
\[\label{ypsapost2} \hat{Y}_{PSA4}= \sum_g \sum_{k \in {s_V}_g} y_k w_k^{PSA4} (\#eq:ypsapost2)\]
Research findings have shown that PSA successfully removes bias in some situations, but at the cost of increasing the variance (Lee and Valliant 2009). (Valliant and Dever 2011) showed that the estimation of a variable using PSA must be complemented with further weighting adjustment in order to make estimates less biased. The use of PSA with further calibration is studied in (Lee and Valliant 2009) and (Ferri-Garcı́a and Rueda 2018), concluding that calibration adjustments are helpful if they are applied using the right covariates.
Variance estimation in PSA is not a simple issue. (Valliant 2019) proposes an estimator of the variance for an estimator of a mean, \(\hat{\overline{y}}\), based on linearization, but this estimator does not take into account the randomness of weight estimation, therefore it will tend to underestimate the variance.
Jackknife’s variance estimator ((Quenouille 1956)) can be seen as an acceptable alternative in nonprobability samples after applying PSA. Let \(\hat{\overline{y}}= \frac{1}{N} \sum_{k \in {s_V}} {w}_k^{PSA} y_k\) be the estimator of the mean of \(y\), his Leave-One-Out Jackknife estimator of the variance is given by: \[\label{jackknife2} \hat{V}(\hat{\overline{y}}) = \frac{n-1}{n} \sum_{j = 1}^n (\overline{y}_{(j)} - \overline{y})^2 (\#eq:jackknife2)\] where \(\overline{y}_{(j)}\) is the value of the estimator \(\hat{\overline{y}}\) after dropping unit \(j\) from \(s_V\) and where \(\overline{y}\) is the mean of values \(\overline{y}_{(j)}\).
Given that PSA weights are estimated from the available data, the exclusion of one unit can have an impact on the values of \(w_i\) and affect the variability of the estimator. This variability can be taken into account if propensities are recalculated for each of the \(n\) Leave-One-Out partitions. Thus a Jackknife estimator with recalculating weights is defined as: \[\label{jackknife3} \hat{V}_{rw}(\hat{\overline{y}}) = \frac{n-1}{n} \sum_{j = 1}^n ( \overline{y}_{rw(j)} - \overline{y}_{rw})^2 (\#eq:jackknife3)\] where \(\overline{y}_{rw(j)} = \frac{1}{N} \displaystyle \sum_{k \in {s_V-\{j\}}} {w}_k^{PSA}(j) y_k\), with \({w}_k^{PSA}(j)\) the PSA weight obtained from the sample \({s_V-\{j\}}\) and \(\overline{y}_{rw}\) is the mean of values \(\overline{y}_{rw(j)}\).
The statistical matching method was introduced by Rivers (2007). The idea is to model the relationship between \(y_k\) and \(\mathbf{x}_k\) using the volunteer sample \(s_V\) in order to predict \(y_k\) for the reference sample. That is, the matching estimator is given by:
\[\hat{Y}_{SM} = \sum_{s_R} \hat{y}_k w_{Rk}\] being \(\hat{y}_k\) the predict value of \(y_k\).
The key is how to predict the values \(y_k\). Usually \(\hat{y}_k= \mathbf{x}_k' \hat{\beta}\) being \(\hat{\beta}= \sum_{k \in {s_V}}y_k\mathbf{x}_k / \sum_{k \in {s_V}} \mathbf{x}_k' \mathbf{x}_k\) but other methods can be considered as donor imputation (Rivers 2007) or fractional donor imputation (Kim and Fuller 2004).
A major drawback of matching is that the precision of the non-probability sample reduces to the standard error of the reference sample (Buelens, Burger, and Brakel 2018). These authors also justify that matching is based on strong ignorability assumptions and can lead biased estimators if the assumptions are not met.
The prediction approach is based on superpopulation models, which assume that the population under study \({\bf y}=(y_1,...,y_N)'\) is a realization of super-population random variables \({\bf Y}= (Y_1,...,Y_N)'\) having a superpopulation model \(\xi\). To incorporate auxiliary information \(\mathbf{x}_k\) available for all \(k\in U\) on assume a superpopulation for \(y\) built on some mean function of \(\mathbf{x}\): \[Y_k = m(\mathbf{x}_k) +e_k, \quad k=1,...,N.\]
The random vector \(e=(e_1,...,e_N)'\) is assumed to have zero mean and a positive definite covariance matrix which is diagonal (\(Y_k\) are mutually independent).
Using a set of covariates, \(\mathbf{x}\), measured in \(s_V\) and \(\overline{s}_V = U -s_V\) it is possible to estimate the values of \(y\) in \(\overline{s}_V\) with regression modeling such that the estimated value of \(y\) for an individual \(k\) can be calculated through the following expression: \[\label{eq11} \hat{y}_k = E_m (y_k | \mathbf{x}_k) (\#eq:eq11)\] \(m\) alludes to the specific model which provides the expectation of \(y_k\), and \(\mathbf{x}_k\) are the values of the \(k\)-th individual in the covariates \(\mathbf{x}\).
We can use the auxiliary information in several ways to define several estimators:
the model-based estimator:
\[\hat{Y}_{m} = \sum_{k \in s_V} y_k + \sum_{k \in \overline{s_V}} \hat{y}_k\]
the model-assisted estimator: \[\hat{Y}_{ma} = \sum_{k \in U} \hat{y}_k + \sum_{k \in s_V} ({y}_k- \hat{y}_k)w_{Vk}\]
the model-calibrated estimator: \[\label{CAL} \hat{Y}_{mcal} = \sum_{k \in s_V} y_{k} w_k^{CAL} (\#eq:CAL)\] where \(w_k^{CAL}\) are such that they minimize \(\sum_{k \in s} G\left(w_k^{CAL}, w_{Vk}\right)\), where \(G(\cdot,\cdot)\) is a particular distance function, subject to \[\sum_{k \in s_V} w_k^{CAL} \hat{y}_k = \sum_{k \in U} \hat{y}_k.\]
Usually the linear regression model is used, \(E_m (y_k | \mathbf{x}_k) = \mathbf{x}_k' \beta\) and the above estimators can be rewritten as a type of regression estimators.
Prediction estimators need complete information about the auxiliary variables (InfoEP) and can fail if the model is not true, but might potentially be fruitful to correct for selection bias in informative sampling (Buelens, Burger, and Brakel 2018).
The emerging data sources like Big Data can be used in combination to traditional survey samples for construct more valid inferences. Machine Learning (ML) methods can be used for the matter, given their known advantages in high dimensional environments. There are several types of learning algorithms but for this package we focus on classification and regression. Classification aims to identify the category to which a new observation belongs while regression is used for prediction in real-valuated variables. Both are trained with known observations to make predictions based on some covariates.
There is a vast spectrum of classification and regression algorithms to take into account, starting from the basic linear and logistic regressions and its extensions, like Ridge regression (Hoerl and Kennard 1970). Other examples are decision trees which uses tree-like graphs , like the C4.5 (Quinlan 1993). More modern approaches even build ensembles of decision trees with outstanding results, like XGBoost (T. Chen and Guestrin 2016). During the last few years, deep learning models have been dramatically improving the state-of-the-art (LeCun, Bengio, and Hinton 2015). However, many other techniques are still being widely used and developed, like some bayesian methods (Park and Casella 2008). Having so many different options, choosing the right learning algorithm for each problem is key for obtaining optimal results.
Regarding survey research, the use of ML algorithms has been studied in the last few years for deriving model-assisted estimators ((Montanari and Ranalli 2007); (Baffetta et al. 2009); (Breidt, Opsomer, et al. 2017)). In the prediction approach ML algorithms uses the sample to train a model capturing the behaviour of a target variable which is to be estimated, and applies it to the nonsampled individuals to obtain population-level estimates. Applications of machine learning algorithms in PSA for nonresponse propensity have been studied for classification and regression trees (Phipps, Toth, et al. 2012) and Random Forests (Buskirk and Kolenikov 2015); their efficacy on reducing nonresponse bias in comparison to logistic regression depends on the available covariates and the complexity of the relationships. (J. K. T. Chen, Valliant, and Elliott 2019) use LASSO for calibrating non-probability surveys. (Buelens, Burger, and Brakel 2018) review existing inference methods to correct for selection bias and recommend adding ML methods to deal with non-probability samples.
NonProbEst allows the use of a wide variety of classification and regression algorithms for model-based, model-assisted and model-calibrated estimators, matching and PSA (which only works with classification). It offers so many alternatives by relying on caret (Kuhn 2018), a well known machine learning package.
The package NonProbEst
implements in R a set of techniques for estimation in non-probability
surveys, using various approaches which correspond to several
frameworks. Functions in the package allow to obtain calibration weights
via calib_weights, propensity scores via
propensities and matching predictions for a reference
sample via matching. Propensity scores can be transformed
into weights by all of the approaches mentioned in previous sections via
functions lee_weights, sc_weights,
valliant_weights, vd_weights. These weights
can be used for estimation of total, mean and proportion of a given
target variable measured in a sample using functions
total_estimation, mean_estimation,
prop_estimation. Alternatively, total and mean can also be
calculated using a model-based, a model-assisted or a model-calibrated
approach with the functions model_based,
model_assisted and model_calibrated
respectively. The variance of the estimators can be calculated using the
Leave-One-Out Jackknife method, this is, recalculating the set of
weights after substracting one unit or not, by means of the functions
generic_jackknife_variance and
jackknife_variance, and without recalculating the weights
via fast_jackknife_variance. Frequentist confidence
intervals of the estimates can be directly computed with the
confidence_interval function.
Calibration weights are obtained using the calib
function of the sampling
package (Tillé and Matei 2016) for
g-weights computation. calib_weights offers a wrapper for
calculation of final weights straight from the dataset. Functions that
require prediction techniques, such as propensities,
matching, model_based,
model_assisted, model_calibrated and
jackknife_variance, use the train function
from the caret
package (Kuhn 2018). This function allows
the user to use any of the algorithms in the large list of functions
which are covered by train, with the possibility of
optimizing hyperparameters for a better performance of the predictors.
For propensity estimation, only classification algorithms should be used
as the target variable is binary (participation in the probability
sample vs participation in the non-probability sample). Case weights are
used to balance both classes (for models that accept them). For
matching, model-based and model-assisted estimations, algorithms should
account for the type of variable of the target feature.
Note that weighting formulas for PSA from Lee
(2006) and Valliant and Dever
(2011) require applying a stratification procedure. In both
lee_weights and vd_weights the same procedure
is applied: the vector of propensities is sorted increasingly, and the
individuals are equally divided in \(g\) strata of the same length according to
their position in the sorted vector. \(g\) is defined by the user, and the
procedure results in a vector with the strata number (from 1 to \(g\)) to which a given individual
corresponds. This stratification avoids errors that could arise from the
lack of unique values.
Three datasets are available in the package: sampleP,
sampleNP and population. These fictitious
datasets were created as described in Ferri-Garcı́a and Rueda (2018);
sampleP represents a probability sample of size \(n_r = 500\) extracted by simple random
sampling from a frame covering the entire population, while
sampleNP represents a non-probability sample of size \(n_v = 1000\) extracted by simple random
sampling from a frame covering only the subpopulation of individuals who
have access to Internet. The dataset of the complete population of size
\(N = 50000\) is available in
population. Variables available in each dataset differ,
with sampleNP having the largest amount of variables. In
the aforementioned dataset, three variables (vote_gen,
vote_pens, vote_pir) measuring whether an
individual would vote to a given party ("gen", "pens" or "pir") in an
election or not. Probabilities of voting to party "gen", "pens" or "pir"
are higher if the individual is a woman, and elder person and has access
to the Internet, respectively. These variables are only measured in
sampleNP, meaning that adjustment methods have to be
applied in order to produce reliable estimates of voting intentions. For
the matter, the rest of the available variables in the dataset, which
are also included in sampleP (except for the language) and
population, can be used. education_primaria,
education_secundaria, education_terciaria are
three disjunct variables measuring the education level of the individual
(Primary, Secondary or Tertiary Education), while age and
sex measures the numeric age and the gender (0 female, 1
male). Finally, language measures whether the individual’s
native language is the official language or not. The absence of certain
variables in the datasets accounts for real situations where not all the
information is available at individual level.
It must be mentioned that the use of jackknife_variance
for calculating the variance of the estimators via Leave-One-Out
Jackknife will be computationally slower than the
fast_jackknife_variance alternative. Recalculating the
weights in each iteration means that the weighting procedure has to be
repeated as many times as individuals are in the non-probability sample.
If Propensity Score Adjustment is used for weighting, the models have to
be rebuilt in each iteration, resulting in larger computation times
which will depend on the computational costs of the algorithms used for
propensity estimation. Note that generic_jackknife_variance
will behave similarly if the estimator passed as argument involves
predictive modelling algorithms or other costly procedures. To show the
difference of procedures, we calculated the Leave-One-Out Jackknife
estimated variance of the estimator of the mean for the variable
vote_pir in a non-probability sample of size \(n_v = 100\) extracted by simple random
sampling on the sampleNP dataset, using a probability
sample of size \(n_r = 100\) extracted
by simple random sampling on the sampleP dataset as the
reference sample data. Considering a population of \(N = 50000\), variance estimates of the
estimator weighted by PSA using different algorithms were computed,
measuring the computation elapsed time. All the calculations were
performed in a Intel(R) Core(TM) i7-3770 CPU up to 3.40GHz. Results can
be consulted in Table 1
| Weight recalculation | PSA algorithm | R function | Elapsed time (seconds) |
|---|---|---|---|
| No | Logistic regression | glm | 0.004999876 |
| Yes | Logistic regression | glm | 75.56034 |
| Yes | CART | rpart | 102.3409 |
| Yes | Random Forest | rf | 203.7737 |
| Yes | GBM | gbm | 453.731 |
| Yes | Neural Network | nnet | 719.733 |
In this example, the variance estimation with recalculations takes more than 15000 times the seconds that it takes without recalculations if logistic regression is the method used for propensity estimation, and almost 144000 times if feed-forward neural networks are used. Time differences might be different depending on the data, the estimator and the algorithm, but they will be largely appreciable in all cases.
In order to ilustrate how the resources in the package can be used for estimation in non-probability surveys, some examples of each adjustment covered by the package are developed in the following section.
Suppose that a non-probability sample of 1000 individuals recruited
via online surveying is available for estimating the vote intention in a
given election. For the matter, sampleNP will be used as
the non-probability sample data.
> library(NonProbEst)
> head(sampleNP)
vote_gen vote_pens vote_pir education_primaria education_secundaria education_terciaria age sex language
1 0 1 0 1 0 0 66 1 1
2 0 0 1 0 0 1 30 1 1
3 1 0 0 0 1 0 62 0 1
4 0 0 1 1 0 0 33 0 1
5 0 0 1 0 1 0 30 0 1
6 0 0 0 1 0 0 69 1 1
Some auxiliary information is available in the sample; more
precisely, individual data on education, age, gender and language (as
described in the previous Section) can be used for mitigating the
effects of coverage error. Population totals are available for all of
these auxiliar variables, as they have been measured for the entire
population. They can be retrieved from the population
dataset:
> head(population)
education_primaria education_secundaria education_terciaria age sex language
1 0 1 0 39 1 1
2 0 0 1 55 0 1
3 1 0 0 35 0 1
4 1 0 0 58 1 1
5 1 0 0 36 1 1
6 0 1 0 61 1 1
> totals <- colSums(population)
> totals
education_primaria education_secundaria education_terciaria age sex language
25287 10546 14167 2539340 24430 45429
If the variables of which population totals are available are not
disjunct, Raking calibration can be applied in order to estimate cell
counts and account for the lack of information. This can be done with
the calib_weights function; in this case, the
Xs argument were the dataset sampleNP
selecting the auxiliar variables only. Other arguments involve the
totals previously obtained and the initial weights, which allows the
user to specify whether sampling design weights were used or not. In the
latter case, unitary weights should be provided as a vector of ones of
length equal to the number of individuals in the non-probability sample.
Population size and method to be used by the calib function
from sampling
have to be specified.
> covariates <- colnames(sampleNP)[4:9]
> initial_weights <- rep(1, nrow(sampleNP))
> w <- calib_weights(sampleNP[, covariates], totals, initial_weights,
N = 50000, method = "raking")
Once we obtain the weights, estimates for the mean (proportion if the
variable is binary) or the total of any variable present in the
non-probability sample can be obtained using
mean_estimation or total_estimation
respectively. For example, the estimated proportion of votes for each
party can be obtained with the following code:
> mean_estimation(sampleNP, w, "vote_gen", N = 50000)
vote_gen
0.09824163
> mean_estimation(sampleNP, w, "vote_pens", N = 50000)
vote_pens
0.3726149
> mean_estimation(sampleNP, w, "vote_pir", N = 50000)
vote_pir
0.3905399
If these estimates are compared to those which would be obtained if no adjustment was used, the effect of calibration is notorious. As the presence of "gen" voters in the sample is MCAR, estimates do not differ, but in the case of "pens" voters whose presence is MAR, the calibration approach gives a larger estimate which can be explained by the fact that the overrepresentation of younger people in the sample has been corrected up to a point. To a much lesser extent, this correction is also noticeable in the estimation of vote to "pir" (presence of their voters in the sample is NMAR).
> sum(sampleNP$vote_gen)/nrow(sampleNP)
[1] 0.096
> sum(sampleNP$vote_pens)/nrow(sampleNP)
[1] 0.346
> sum(sampleNP$vote_pir)/nrow(sampleNP)
[1] 0.404
> sum(sampleNP$vote_gen)/nrow(sampleNP) -
+ mean_estimation(sampleNP, w, "vote_gen", N = 50000)
vote_gen
-0.00224163
> sum(sampleNP$vote_pens)/nrow(sampleNP) -
+ mean_estimation(sampleNP, w, "vote_pens", N = 50000)
vote_pens
-0.02661494
> sum(sampleNP$vote_pir)/nrow(sampleNP) -
+ mean_estimation(sampleNP, w, "vote_pir", N = 50000)
vote_pir
0.01346014
The variance of the estimates can be assessed through Leave-One-Out Jackknife, both with or without reweighting in each iteration. In the former case, a function must be created by the user for such a task. In the following lines, a function example is developed for estimating the variance on the estimation of the proportion of votes for the "pir" party:
### Leave-One-Out Jackknife variance estimation with reweighting
> estimator <- function(s){
initial_weights <- rep(1, nrow(s))
w <- calib_weights(s[,covariates], totals, initial_weights, N = 50000,
method = "raking")
return(mean_estimation(s, w, "vote_pir", N = 50000))
}
> v_r <- generic_jackknife_variance(sampleNP, estimator, N = 50000)
> v_r
[1] 0.0003352199
### Leave-One-Out Jackknife variance estimation without reweighting
> v_nr <- fast_jackknife_variance(sampleNP, w, estimated_vars = "vote_pir", N = 50000)
> v_nr
vote_pir
0.0003189449
These estimates of the variance can be used for the construction of
confidence intervals for the estimation of the proportion via
confidence_interval function. This function requires the
point estimator and the standard deviation as arguments, with the option
to fix the confidence level. If not specified by the user, the
confidence interval is calculated at 95\(\%\) confidence level.
> ic_r <- confidence_interval( mean_estimation(sampleNP, w, "vote_pir", N = 50000),
sqrt(v_r)
)
> ic_r
lower.vote_pir upper.vote_pir
0.3546549 0.4264249
> ic_nr <- confidence_interval( mean_estimation(sampleNP, w, "vote_pir", N = 50000),
sqrt(v_nr)
)
> ic_nr
lower.vote_pir upper.vote_pir
0.3555368 0.4255429
Suppose that, in addition to the non-probability sample, a
probability sample of the same target population is available as
auxiliary information. The target variable is not measured, but some
other variables which are also available in the non-probability sample
have been measured on it. For the matter, sampleP will be
used as data from the probability sample.
> head(sampleP)
education_primaria education_secundaria education_terciaria age sex
1 1 0 0 35 1
2 0 0 1 64 0
3 1 0 0 55 1
4 0 1 0 61 1
5 0 0 1 35 0
6 1 0 0 51 1
In order to reduce the selection bias, Propensity Score Adjustment
can be used in this case for reweighting. This procedure is implemented
in the propensities function; it requires both samples, the
list of covariates to be used to build the models for propensity
estimation, and three arguments regarding technical aspects of the
adjustment: the prediction algorithm (must match any of the list of
caret supported algorithms), a boolean indicating whether
smoothing of propensities is applied or not, and a vector of strings
specifying the preprocessing procedures to be passed to
train (by default, preprocessing is not applied). Further
arguments to be passed to train can be specified.
In this example, the propensity of participating will be estimated
using k-Nearest Neighbors with further smoothing and a parameter grid of
all the odd numbers between 3 and 11 for optimization of \(k\). The covariates will be all the
variables measured in sampleP. The result will be a list
with two vectors: the estimated propensities for individuals in the
non-probability (convenience) and the probability (reference) sample
respectively.
> covariates <- colnames(sampleP)
> pi <- propensities(sampleNP, sampleP, covariates,
algorithm = "knn", smooth = T, tuneGrid = data.frame(k = seq(3, 11, by = 2)))
> summary(pi$convenience)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3079 0.6249 0.6873 0.6834 0.7584 0.9995
> summary(pi$reference)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3079 0.5384 0.6388 0.6236 0.6998 0.9469
The propensities must be subsequently transformed into weights for
their application in survey estimation. Transformations available in NonProbEst
include approaches developed by Lee (2006)
and Lee and Valliant (2009) in the
lee_weights function, Valliant and
Dever (2011) in the vd_weights function, Schonlau and Couper (2017) in the
sc_weights function and Valliant
(2019) in the valliant_weights function.
lee_weights and vd_weights require
propensities of both samples and a number of strata (5 by default),
while sc_weights and valliant_weights only
require propensities of the non-probability sample.
For example, if we want to apply propensities via weights developed in Valliant and Dever (2011) for the estimation of voting intention to party "pir", we can do it with the following code:
> wi <- vd_weights(convenience_propensities = pi$convenience,
reference_propensities = pi$reference)
> summary(wi)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.233 1.376 1.493 1.505 1.632 2.011
> mean_estimation(sample = sampleNP, weights = wi,
estimated_vars = "vote_pir")
vote_pir
0.4006072
#Estimation of the 95% confidence interval
> estim <- mean_estimation(sample = sampleNP, weights = wi,
estimated_vars = "vote_pir")
> std_dev <- fast_jackknife_variance(sample = sampleNP, weights = wi,
estimated_vars = "vote_pir", N = 50000)
> confidence_interval(estimation = estim, std_dev = std_dev, confidence = 0.95)
lower.vote_pir upper.vote_pir
0.4001341 0.4010803
Note that for those weights that are calculated by means of propensity stratification, propensities of the individuals in the convenience and reference sample are needed. If they are calculated by inverting propensities, only those for the individuals in the convenience sample are needed. For example, if we calculate weights via the formula developed in Schonlau and Couper (2017), the code is:
> wi <- sc_weights(propensities = pi$convenience)
> summary(wi)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0004998 0.3185741 0.4549419 0.5044062 0.6003197 2.2479720
Apart from direct estimation, resulting weights can be used as inputs
in the initial_weights argument of the
calib_weights function for the estimation with PSA and
calibration, or with the package survey
(Lumley 2018) for more complex
analysis.
In this case, in addition to the non-probability sample, the
population itself is avaliable for some covariates. However, the target
variable is only measured in the non-probability sample. For the matter,
sampleNP will be used as the non-probability sample data
and population will be used as the population data.
The model-based estimator can be used to estimate the population
total (or mean) for the target variable. In this example, the expected
number of votes for "pens" will be estimated with regularized logistic regression as learning
algorithm. This procedure is implemented in the model_based
function. It requires the sample, the population, the covariates names
and the target variable as arguments. In our example, the specific
algorithm and a normalization preprocessing are passed to change default
behaviour. Since no optimization strategy is specified in this case, a
default bootstrap will be applied.
> covariates <- c("education_primaria", "education_secundaria",
"education_terciaria", "age", "sex", "language")
> mySample = sampleNP
> mySample$vote_pens = factor(mySample$vote_pens, c(0, 1), c('F', 'T'))
> model_based(mySample, population, covariates, "vote_pens",
positive_label = 'T', algorithm = "glmnet", proc = c("center", "scale"))
[1] 18282.51
If the proportion of votes has to be estimated, rather than the
total, it would be as simple as adding the estimate_mean
argument as follows:
> model_based(mySample, population, covariates, "vote_pens", positive_label = 'T',
algorithm = "glmnet", proc = c("center", "scale"), estimate_mean = TRUE)
[1] 0.366757
Alternatively, model-calibrated estimator can be used to achieve higher efficiency in some situations. In that case, design weights have to be specified in the argument "weights", in addition to the rest of arguments previously described. If no sampling design was followed in data collection, which is the case that we suppose in our example, we can specify unitary weights by turning the parameter to 1, as it is done in the following code:
> model_calibrated(sample_data = mySample, weights = 1, full_data = population,
+ covariates = covariates, estimated_var = "vote_pens", positive_label = 'T',
+ algorithm = "glmnet", proc = c("center","scale"),
+ estimate_mean = TRUE)
[1] 0.365945
In this paper we show how the NonProbEst package can simplify the application of different weighting methods to correct selection bias in non-probability surveys. This package is, to the best of our knowledge, the first package that supports the user beyond estimation in PSA, PSA+calibration, statistical matching or model-calibration. Another important feature is that a wide range of ML techniques can be used to optimize the information provided by the auxiliary variables.
Additional features will be integrated in future versions of the package. Some simplified wrappers will be developed for some methods so non-expert users can also easily apply them, more parameters will be avaliable for estimation and further support for weighted models will be added. Also, other techniques for variance estimation can be considered. Many of these features can already be applied combining NonProbEst with the survey package, as noted before.
Regarding Machine Learning, methods for variable selection will be studied as well as the use of more advanced deep learning libraries outside of caret’s scope. Variable selection would help explaining the bias and choosing the best covariates for its correction. Better deep learning libraries would allow the use of state-of-the-art algorithms.
This work is partially supported by Ministerio de Economía y Competitividad of Spain (grant MTM2015-63609-R) and by Ministerio de Ciencia, Innovación y Universidades (grant FPU17/02177).