Multiple Imputation and Missing Data
Source:vignettes/multiple_imputation.Rmd
multiple_imputation.Rmd
The marginaleffects
package offers convenience functions
to compute and display predictions, contrasts, and marginal effects from
models with multiple imputation from the mice
package. The
workflow follows Rubin’s rules (Rubin, 1987, p. 76), via the following
steps:
- Impute \(M\) data sets.
- Fit in each of the \(M\) imputed data sets.
- Compute marginal effects in each of the \(M\) data sets.
- Pool results.
To highlight the workflow, we consider a simple linear regression
model, although the same workflow should work with any model type which
is fit using a formula interface and a data
argument.
marginaleffects
support the mice
imputation
package, and any other package which can return a list of imputed data
frames.
mice
First, we insert missing observations randomly in a dataset and we
impute the dataset using the mice
package:
library(mice)
library(marginaleffects)
set.seed(1024)
dat <- iris
dat$Sepal.Length[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Sepal.Width[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Species[sample(seq_len(nrow(iris)), 40)] <- NA
dat_mice <- mice(dat, m = 20, printFlag = FALSE, .Random.seed = 1024)
Then, we use the standard mice
syntax to produce an
object of class mira
with all the models:
Finally, we feed the mira
object to a
marginaleffects
function:
mfx_mice <- avg_slopes(mod_mice, by = "Species")
mfx_mice
#>
#> Term Contrast Species Estimate Std. Error Df t Pr(>|t|) 2.5 % 97.5 %
#> Sepal.Length mean(dY/dX) setosa 0.0684 0.0560 120.0 1.222 0.22413 -0.0424 0.179
#> Sepal.Length mean(dY/dX) versicolor 0.0540 0.0557 91.1 0.969 0.33507 -0.0567 0.165
#> Sepal.Length mean(dY/dX) virginica 0.0582 0.0513 104.6 1.136 0.25869 -0.0434 0.160
#> Sepal.Width mean(dY/dX) setosa 0.1890 0.0836 400.4 2.260 0.02434 0.0246 0.353
#> Sepal.Width mean(dY/dX) versicolor 0.2093 0.0792 89.2 2.643 0.00971 0.0519 0.367
#> Sepal.Width mean(dY/dX) virginica 0.2241 0.1026 61.2 2.185 0.03272 0.0190 0.429
#> Species mean(versicolor) - mean(setosa) setosa 1.1399 0.0977 114.8 11.668 < 0.001 0.9464 1.333
#> Species mean(versicolor) - mean(setosa) versicolor 1.1399 0.0977 114.8 11.668 < 0.001 0.9464 1.333
#> Species mean(versicolor) - mean(setosa) virginica 1.1399 0.0977 114.8 11.668 < 0.001 0.9464 1.333
#> Species mean(virginica) - mean(setosa) setosa 1.7408 0.1108 121.6 15.709 < 0.001 1.5214 1.960
#> Species mean(virginica) - mean(setosa) versicolor 1.7408 0.1108 121.6 15.709 < 0.001 1.5214 1.960
#> Species mean(virginica) - mean(setosa) virginica 1.7408 0.1108 121.6 15.709 < 0.001 1.5214 1.960
#>
#> Columns: term, contrast, Species, estimate, std.error, predicted, predicted_hi, predicted_lo, df, statistic, p.value, conf.low, conf.high
Other imputation packages: Amelia
,
missRanger
, or lists of imputed data frames.
Several R
packages can impute missing data. Indeed, the
Missing Data CRAN View
lists at least a dozen
alternatives. Since user interface changes a lot from package to
package, marginaleffects
supports a single workflow which
can be used, with some adaptation, to with all imputation packages:
- Use an external package to create a list of imputed data frames.
- Apply the
datalist2mids()
function from themiceadds
package to convert the list of imputed data frames to amids
object. - Use the
with()
function to fit models, as illustrated in themice
section above. - Pass the
mids
object to amarginaleffects
function.
Consider two imputation packages, which can both generate lists of
imputed datasets: Amelia
and missRanger
.
library(Amelia)
library(miceadds)
library(missRanger)
# impute data
dat_amelia <- amelia(dat, noms = "Species", p2s = 0)$imputations
mids_amelia <- datlist2mids(dat_amelia)
# convert lists of imputed datasets to `mids` objects
dat_missRanger <- replicate(20, missRanger(dat, verbose = 0), simplify = FALSE)
mids_missRanger <- datlist2mids(dat_missRanger)
# fit models
mod_amelia <- with(mids_amelia, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))
mod_missRanger <- with(mids_missRanger, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))
# `Amelia` slopes
mfx_amelia <- avg_slopes(mod_amelia, by = "Species")
mfx_amelia
#>
#> Term Contrast Species Estimate Std. Error Df t Pr(>|t|) 2.5 % 97.5 %
#> Sepal.Length mean(dY/dX) setosa 0.3456 0.0935 7.92 3.696 0.00618 0.1296 0.562
#> Sepal.Length mean(dY/dX) virginica 0.3058 0.1020 6.52 2.998 0.02177 0.0609 0.551
#> Sepal.Length mean(dY/dX) versicolor 0.3008 0.0972 6.75 3.095 0.01829 0.0692 0.532
#> Sepal.Width mean(dY/dX) setosa -0.1690 0.1693 7.30 -0.998 0.35009 -0.5659 0.228
#> Sepal.Width mean(dY/dX) virginica -0.0621 0.1692 6.65 -0.367 0.72495 -0.4665 0.342
#> Sepal.Width mean(dY/dX) versicolor -0.0596 0.1510 7.77 -0.395 0.70348 -0.4097 0.290
#> Species mean(versicolor) - mean(setosa) setosa 0.6696 0.2063 5.88 3.246 0.01805 0.1624 1.177
#> Species mean(versicolor) - mean(setosa) virginica 0.6696 0.2063 5.88 3.246 0.01805 0.1624 1.177
#> Species mean(versicolor) - mean(setosa) versicolor 0.6696 0.2063 5.88 3.246 0.01805 0.1624 1.177
#> Species mean(virginica) - mean(setosa) setosa 1.1303 0.2169 6.22 5.211 0.00178 0.6041 1.656
#> Species mean(virginica) - mean(setosa) virginica 1.1303 0.2169 6.22 5.211 0.00178 0.6041 1.656
#> Species mean(virginica) - mean(setosa) versicolor 1.1303 0.2169 6.22 5.211 0.00178 0.6041 1.656
#>
#> Columns: term, contrast, Species, estimate, std.error, predicted, predicted_hi, predicted_lo, df, statistic, p.value, conf.low, conf.high
# `missRanger` slopes
mfx_missRanger <- avg_slopes(mod_missRanger, by = "Species")
mfx_missRanger
#>
#> Term Contrast Species Estimate Std. Error Df t Pr(>|t|) 2.5 % 97.5 %
#> Sepal.Length mean(dY/dX) setosa 0.0583 0.0414 3386591 1.41 0.15927 -0.02287 0.139
#> Sepal.Length mean(dY/dX) versicolor 0.0670 0.0392 772141 1.71 0.08745 -0.00984 0.144
#> Sepal.Length mean(dY/dX) virginica 0.0639 0.0368 1300639 1.74 0.08211 -0.00814 0.136
#> Sepal.Width mean(dY/dX) setosa 0.2314 0.0691 3034569 3.35 < 0.001 0.09600 0.367
#> Sepal.Width mean(dY/dX) versicolor 0.2188 0.0551 1270189 3.97 < 0.001 0.11091 0.327
#> Sepal.Width mean(dY/dX) virginica 0.2092 0.0687 625864 3.05 0.00232 0.07457 0.344
#> Species mean(versicolor) - mean(setosa) setosa 1.1588 0.0704 4239295 16.45 < 0.001 1.02078 1.297
#> Species mean(versicolor) - mean(setosa) versicolor 1.1588 0.0704 4239295 16.45 < 0.001 1.02078 1.297
#> Species mean(versicolor) - mean(setosa) virginica 1.1588 0.0704 4239295 16.45 < 0.001 1.02078 1.297
#> Species mean(virginica) - mean(setosa) setosa 1.7784 0.0823 4276524 21.61 < 0.001 1.61710 1.940
#> Species mean(virginica) - mean(setosa) versicolor 1.7784 0.0823 4276524 21.61 < 0.001 1.61710 1.940
#> Species mean(virginica) - mean(setosa) virginica 1.7784 0.0823 4276524 21.61 < 0.001 1.61710 1.940
#>
#> Columns: term, contrast, Species, estimate, std.error, predicted, predicted_hi, predicted_lo, df, statistic, p.value, conf.low, conf.high
Comparing results with different imputation software
We can use the modelsummary
package to compare the
results with listwise deletion to the results using different
imputations software:
library(modelsummary)
# listwise deletion slopes
mod_lwd <- lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species, data = dat)
mfx_lwd <- avg_slopes(mod_lwd, by = "Species")
# regression table
models <- list(
"LWD" = mfx_lwd,
"mice" = mfx_mice,
"Amelia" = mfx_amelia,
"missRanger" = mfx_missRanger)
modelsummary(models, shape = term : contrast + Species ~ model)
Species | LWD | mice | Amelia | missRanger | |
---|---|---|---|---|---|
Sepal.Length mean(dY/dX) | setosa | 0.033 | 0.068 | 0.346 | 0.058 |
setosa | (0.061) | (0.056) | (0.094) | (0.041) | |
versicolor | 0.050 | 0.054 | 0.301 | 0.067 | |
versicolor | (0.061) | (0.056) | (0.097) | (0.039) | |
virginica | 0.043 | 0.058 | 0.306 | 0.064 | |
virginica | (0.058) | (0.051) | (0.102) | (0.037) | |
Sepal.Width mean(dY/dX) | setosa | 0.274 | 0.189 | −0.169 | 0.231 |
setosa | (0.091) | (0.084) | (0.169) | (0.069) | |
versicolor | 0.255 | 0.209 | −0.060 | 0.219 | |
versicolor | (0.074) | (0.079) | (0.151) | (0.055) | |
virginica | 0.234 | 0.224 | −0.062 | 0.209 | |
virginica | (0.083) | (0.103) | (0.169) | (0.069) | |
Species mean(versicolor) - mean(setosa) | setosa | 1.157 | 1.140 | 0.670 | 1.159 |
setosa | (0.097) | (0.098) | (0.206) | (0.070) | |
versicolor | 1.157 | 1.140 | 0.670 | 1.159 | |
versicolor | (0.097) | (0.098) | (0.206) | (0.070) | |
virginica | 1.157 | 1.140 | 0.670 | 1.159 | |
virginica | (0.097) | (0.098) | (0.206) | (0.070) | |
Species mean(virginica) - mean(setosa) | setosa | 1.839 | 1.741 | 1.130 | 1.778 |
setosa | (0.123) | (0.111) | (0.217) | (0.082) | |
versicolor | 1.839 | 1.741 | 1.130 | 1.778 | |
versicolor | (0.123) | (0.111) | (0.217) | (0.082) | |
virginica | 1.839 | 1.741 | 1.130 | 1.778 | |
virginica | (0.123) | (0.111) | (0.217) | (0.082) | |
Num.Obs. | 60 | 150 | 150 | 150 | |
Num.Imp. | 20 | 5 | 20 | ||
R2 | 0.953 | 0.930 | 0.873 | 0.947 | |
R2 Adj. | 0.949 | 0.928 | 0.869 | 0.945 | |
AIC | −34.0 | ||||
BIC | −19.3 | ||||
Log.Lik. | 23.997 | ||||
F | 220.780 | ||||
RMSE | 0.16 |
Passing new data arguments: newdata
, wts
,
by
, etc.
Sometimes we want to pass arguments changing or specifying the data
on which we will do our analysis using marginaleffects
.
This can be for reasons such as wanting to specify the values or weights
at which we evaluate e.g. avg_slopes
, or due to the
underlying models not robustly preserving all the original data columns
(such as fixest
objects not saving their data in the fit
object making it potentially challenging to retrieve, and even if
retrievable it will not include the weights used during fitting as a
column as wts
expects when given a string).
If we are not using multiple imputation, or if we want to just pass a
single dataset to the several fitted models after multiple imputation,
we can pass a single dataset to the newdata
argument.
However, if we wish to supply each model in our list resulting after
multiple imputation with a /different/ dataset on which to calculate
results, we cannot use newdata
. Instead, in this case it
can be useful to revert to a more manual (but still very easy) approach.
Here is an example calculating avg_slopes
using a different
set of weights for each of the fixest
models which we fit
after multiple imputation.
set.seed(1024)
library(mice)
library(fixest)
library(marginaleffects)
dat <- mtcars
# insert missing values
dat$hp[sample(seq_len(nrow(mtcars)), 10)] <- NA
dat$mpg[sample(seq_len(nrow(mtcars)), 10)] <- NA
dat$gear[sample(seq_len(nrow(mtcars)), 10)] <- NA
# multiple imputation
dat <- mice(dat, m = 5, method = "sample", printFlag = FALSE)
dat <- complete(dat, action = "all")
# fit models
mod <- lapply(dat, \(x)
feglm(am ~ mpg * cyl + hp,
weight = ~gear,
family = binomial,
data = x))
# slopes without weights
lapply(seq_along(mod), \(i)
avg_slopes(mod[[i]], newdata = dat[[i]])) |>
mice::pool()
#> Class: mipo m = 5
#> term m estimate ubar b t dfcom df riv lambda fmi
#> 1 mpg 5 0.006083006 1.080596e-04 2.722367e-04 4.347437e-04 29 3.458473 3.023184 0.7514407 0.8284122
#> 2 cyl 5 -0.134279397 7.095810e-04 2.347149e-03 3.526159e-03 29 2.921517 3.969355 0.7987666 0.8667335
#> 3 hp 5 0.001649797 5.709181e-07 1.375524e-06 2.221547e-06 29 3.556952 2.891184 0.7430088 0.8213962
# slopes with weights
lapply(seq_along(mod), \(i)
avg_slopes(mod[[i]], newdata = dat[[i]], wts = "gear")) |>
mice::pool()
#> Class: mipo m = 5
#> term m estimate ubar b t dfcom df riv lambda fmi
#> 1 mpg 5 0.006251264 1.056051e-04 2.705362e-04 4.302485e-04 29 3.422444 3.074127 0.7545486 0.8309841
#> 2 cyl 5 -0.135839559 7.279608e-04 2.480980e-03 3.705137e-03 29 2.868429 4.089748 0.8035266 0.8704861
#> 3 hp 5 0.001671181 5.697634e-07 1.424665e-06 2.279361e-06 29 3.474829 3.000540 0.7500337 0.8272454