Skip to contents

The marginaleffects package offers convenience functions to compute and display predictions, contrasts, and marginal effects from models with multiple imputation from the mice package. The workflow follows Rubin’s rules (Rubin, 1987, p. 76), via the following steps:

  1. Impute \(M\) data sets.
  2. Fit in each of the \(M\) imputed data sets.
  3. Compute marginal effects in each of the \(M\) data sets.
  4. Pool results.

To highlight the workflow, we consider a simple linear regression model, although the same workflow should work with any model type which is fit using a formula interface and a data argument.

marginaleffects support the mice imputation package, and any other package which can return a list of imputed data frames.

mice

First, we insert missing observations randomly in a dataset and we impute the dataset using the mice package:

library(mice)
library(marginaleffects)
set.seed(1024)

dat <- iris
dat$Sepal.Length[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Sepal.Width[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Species[sample(seq_len(nrow(iris)), 40)] <- NA

dat_mice <- mice(dat, m = 20, printFlag = FALSE, .Random.seed = 1024)

Then, we use the standard mice syntax to produce an object of class mira with all the models:

mod_mice <- with(dat_mice, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

Finally, we feed the mira object to a marginaleffects function:

mfx_mice <- avg_slopes(mod_mice, by = "Species")
mfx_mice
#> 
#>          Term                        Contrast    Species Estimate Std. Error    Df      t Pr(>|t|)   2.5 % 97.5 %
#>  Sepal.Length mean(dY/dX)                     setosa       0.0684     0.0560 120.0  1.222  0.22413 -0.0424  0.179
#>  Sepal.Length mean(dY/dX)                     versicolor   0.0540     0.0557  91.1  0.969  0.33507 -0.0567  0.165
#>  Sepal.Length mean(dY/dX)                     virginica    0.0582     0.0513 104.6  1.136  0.25869 -0.0434  0.160
#>  Sepal.Width  mean(dY/dX)                     setosa       0.1890     0.0836 400.4  2.260  0.02434  0.0246  0.353
#>  Sepal.Width  mean(dY/dX)                     versicolor   0.2093     0.0792  89.2  2.643  0.00971  0.0519  0.367
#>  Sepal.Width  mean(dY/dX)                     virginica    0.2241     0.1026  61.2  2.185  0.03272  0.0190  0.429
#>  Species      mean(versicolor) - mean(setosa) setosa       1.1399     0.0977 114.8 11.668  < 0.001  0.9464  1.333
#>  Species      mean(versicolor) - mean(setosa) versicolor   1.1399     0.0977 114.8 11.668  < 0.001  0.9464  1.333
#>  Species      mean(versicolor) - mean(setosa) virginica    1.1399     0.0977 114.8 11.668  < 0.001  0.9464  1.333
#>  Species      mean(virginica) - mean(setosa)  setosa       1.7408     0.1108 121.6 15.709  < 0.001  1.5214  1.960
#>  Species      mean(virginica) - mean(setosa)  versicolor   1.7408     0.1108 121.6 15.709  < 0.001  1.5214  1.960
#>  Species      mean(virginica) - mean(setosa)  virginica    1.7408     0.1108 121.6 15.709  < 0.001  1.5214  1.960
#> 
#> Columns: term, contrast, Species, estimate, std.error, predicted, predicted_hi, predicted_lo, df, statistic, p.value, conf.low, conf.high

Other imputation packages: Amelia, missRanger, or lists of imputed data frames.

Several R packages can impute missing data. Indeed, the Missing Data CRAN View lists at least a dozen alternatives. Since user interface changes a lot from package to package, marginaleffects supports a single workflow which can be used, with some adaptation, to with all imputation packages:

  1. Use an external package to create a list of imputed data frames.
  2. Apply the datalist2mids() function from the miceadds package to convert the list of imputed data frames to a mids object.
  3. Use the with() function to fit models, as illustrated in the mice section above.
  4. Pass the mids object to a marginaleffects function.

Consider two imputation packages, which can both generate lists of imputed datasets: Amelia and missRanger.

library(Amelia)
library(miceadds)
library(missRanger)

# impute data
dat_amelia <- amelia(dat, noms = "Species", p2s = 0)$imputations
mids_amelia <- datlist2mids(dat_amelia)

# convert lists of imputed datasets to `mids` objects
dat_missRanger <- replicate(20, missRanger(dat, verbose = 0), simplify = FALSE)
mids_missRanger <- datlist2mids(dat_missRanger)

# fit models
mod_amelia <- with(mids_amelia, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))
mod_missRanger <- with(mids_missRanger, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

# `Amelia` slopes
mfx_amelia <- avg_slopes(mod_amelia, by = "Species")
mfx_amelia
#> 
#>          Term                        Contrast    Species Estimate Std. Error   Df      t Pr(>|t|)   2.5 % 97.5 %
#>  Sepal.Length mean(dY/dX)                     setosa       0.3456     0.0935 7.92  3.696  0.00618  0.1296  0.562
#>  Sepal.Length mean(dY/dX)                     virginica    0.3058     0.1020 6.52  2.998  0.02177  0.0609  0.551
#>  Sepal.Length mean(dY/dX)                     versicolor   0.3008     0.0972 6.75  3.095  0.01829  0.0692  0.532
#>  Sepal.Width  mean(dY/dX)                     setosa      -0.1690     0.1693 7.30 -0.998  0.35009 -0.5659  0.228
#>  Sepal.Width  mean(dY/dX)                     virginica   -0.0621     0.1692 6.65 -0.367  0.72495 -0.4665  0.342
#>  Sepal.Width  mean(dY/dX)                     versicolor  -0.0596     0.1510 7.77 -0.395  0.70348 -0.4097  0.290
#>  Species      mean(versicolor) - mean(setosa) setosa       0.6696     0.2063 5.88  3.246  0.01805  0.1624  1.177
#>  Species      mean(versicolor) - mean(setosa) virginica    0.6696     0.2063 5.88  3.246  0.01805  0.1624  1.177
#>  Species      mean(versicolor) - mean(setosa) versicolor   0.6696     0.2063 5.88  3.246  0.01805  0.1624  1.177
#>  Species      mean(virginica) - mean(setosa)  setosa       1.1303     0.2169 6.22  5.211  0.00178  0.6041  1.656
#>  Species      mean(virginica) - mean(setosa)  virginica    1.1303     0.2169 6.22  5.211  0.00178  0.6041  1.656
#>  Species      mean(virginica) - mean(setosa)  versicolor   1.1303     0.2169 6.22  5.211  0.00178  0.6041  1.656
#> 
#> Columns: term, contrast, Species, estimate, std.error, predicted, predicted_hi, predicted_lo, df, statistic, p.value, conf.low, conf.high

# `missRanger` slopes
mfx_missRanger <- avg_slopes(mod_missRanger, by = "Species")
mfx_missRanger
#> 
#>          Term                        Contrast    Species Estimate Std. Error      Df     t Pr(>|t|)    2.5 % 97.5 %
#>  Sepal.Length mean(dY/dX)                     setosa       0.0583     0.0414 3386591  1.41  0.15927 -0.02287  0.139
#>  Sepal.Length mean(dY/dX)                     versicolor   0.0670     0.0392  772141  1.71  0.08745 -0.00984  0.144
#>  Sepal.Length mean(dY/dX)                     virginica    0.0639     0.0368 1300639  1.74  0.08211 -0.00814  0.136
#>  Sepal.Width  mean(dY/dX)                     setosa       0.2314     0.0691 3034569  3.35  < 0.001  0.09600  0.367
#>  Sepal.Width  mean(dY/dX)                     versicolor   0.2188     0.0551 1270189  3.97  < 0.001  0.11091  0.327
#>  Sepal.Width  mean(dY/dX)                     virginica    0.2092     0.0687  625864  3.05  0.00232  0.07457  0.344
#>  Species      mean(versicolor) - mean(setosa) setosa       1.1588     0.0704 4239295 16.45  < 0.001  1.02078  1.297
#>  Species      mean(versicolor) - mean(setosa) versicolor   1.1588     0.0704 4239295 16.45  < 0.001  1.02078  1.297
#>  Species      mean(versicolor) - mean(setosa) virginica    1.1588     0.0704 4239295 16.45  < 0.001  1.02078  1.297
#>  Species      mean(virginica) - mean(setosa)  setosa       1.7784     0.0823 4276524 21.61  < 0.001  1.61710  1.940
#>  Species      mean(virginica) - mean(setosa)  versicolor   1.7784     0.0823 4276524 21.61  < 0.001  1.61710  1.940
#>  Species      mean(virginica) - mean(setosa)  virginica    1.7784     0.0823 4276524 21.61  < 0.001  1.61710  1.940
#> 
#> Columns: term, contrast, Species, estimate, std.error, predicted, predicted_hi, predicted_lo, df, statistic, p.value, conf.low, conf.high

Comparing results with different imputation software

We can use the modelsummary package to compare the results with listwise deletion to the results using different imputations software:

library(modelsummary)

# listwise deletion slopes
mod_lwd <- lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species, data = dat)
mfx_lwd <- avg_slopes(mod_lwd, by = "Species")

# regression table
models <- list(
    "LWD" = mfx_lwd,
    "mice" = mfx_mice,
    "Amelia" = mfx_amelia,
    "missRanger" = mfx_missRanger)
modelsummary(models, shape = term : contrast + Species ~ model)
Species LWD mice Amelia missRanger
Sepal.Length mean(dY/dX) setosa 0.033 0.068 0.346 0.058
setosa (0.061) (0.056) (0.094) (0.041)
versicolor 0.050 0.054 0.301 0.067
versicolor (0.061) (0.056) (0.097) (0.039)
virginica 0.043 0.058 0.306 0.064
virginica (0.058) (0.051) (0.102) (0.037)
Sepal.Width mean(dY/dX) setosa 0.274 0.189 −0.169 0.231
setosa (0.091) (0.084) (0.169) (0.069)
versicolor 0.255 0.209 −0.060 0.219
versicolor (0.074) (0.079) (0.151) (0.055)
virginica 0.234 0.224 −0.062 0.209
virginica (0.083) (0.103) (0.169) (0.069)
Species mean(versicolor) - mean(setosa) setosa 1.157 1.140 0.670 1.159
setosa (0.097) (0.098) (0.206) (0.070)
versicolor 1.157 1.140 0.670 1.159
versicolor (0.097) (0.098) (0.206) (0.070)
virginica 1.157 1.140 0.670 1.159
virginica (0.097) (0.098) (0.206) (0.070)
Species mean(virginica) - mean(setosa) setosa 1.839 1.741 1.130 1.778
setosa (0.123) (0.111) (0.217) (0.082)
versicolor 1.839 1.741 1.130 1.778
versicolor (0.123) (0.111) (0.217) (0.082)
virginica 1.839 1.741 1.130 1.778
virginica (0.123) (0.111) (0.217) (0.082)
Num.Obs. 60 150 150 150
Num.Imp. 20 5 20
R2 0.953 0.930 0.873 0.947
R2 Adj. 0.949 0.928 0.869 0.945
AIC −34.0
BIC −19.3
Log.Lik. 23.997
F 220.780
RMSE 0.16

Passing new data arguments: newdata, wts, by, etc.

Sometimes we want to pass arguments changing or specifying the data on which we will do our analysis using marginaleffects. This can be for reasons such as wanting to specify the values or weights at which we evaluate e.g. avg_slopes, or due to the underlying models not robustly preserving all the original data columns (such as fixest objects not saving their data in the fit object making it potentially challenging to retrieve, and even if retrievable it will not include the weights used during fitting as a column as wts expects when given a string).

If we are not using multiple imputation, or if we want to just pass a single dataset to the several fitted models after multiple imputation, we can pass a single dataset to the newdata argument. However, if we wish to supply each model in our list resulting after multiple imputation with a /different/ dataset on which to calculate results, we cannot use newdata. Instead, in this case it can be useful to revert to a more manual (but still very easy) approach. Here is an example calculating avg_slopes using a different set of weights for each of the fixest models which we fit after multiple imputation.

set.seed(1024)
library(mice)
library(fixest)
library(marginaleffects)

dat <- mtcars

# insert missing values
dat$hp[sample(seq_len(nrow(mtcars)), 10)] <- NA
dat$mpg[sample(seq_len(nrow(mtcars)), 10)] <- NA
dat$gear[sample(seq_len(nrow(mtcars)), 10)] <- NA

# multiple imputation
dat <- mice(dat, m = 5, method = "sample", printFlag = FALSE)
dat <- complete(dat, action = "all")

# fit models
mod <- lapply(dat, \(x) 
    feglm(am ~ mpg * cyl + hp,
        weight = ~gear,
        family = binomial,
        data = x))

# slopes without weights
lapply(seq_along(mod), \(i) 
    avg_slopes(mod[[i]], newdata = dat[[i]])) |>
    mice::pool()
#> Class: mipo    m = 5 
#>   term m     estimate         ubar            b            t dfcom       df      riv    lambda       fmi
#> 1  mpg 5  0.006083006 1.080596e-04 2.722367e-04 4.347437e-04    29 3.458473 3.023184 0.7514407 0.8284122
#> 2  cyl 5 -0.134279397 7.095810e-04 2.347149e-03 3.526159e-03    29 2.921517 3.969355 0.7987666 0.8667335
#> 3   hp 5  0.001649797 5.709181e-07 1.375524e-06 2.221547e-06    29 3.556952 2.891184 0.7430088 0.8213962

# slopes with weights
lapply(seq_along(mod), \(i) 
    avg_slopes(mod[[i]], newdata = dat[[i]], wts = "gear")) |>
    mice::pool()
#> Class: mipo    m = 5 
#>   term m     estimate         ubar            b            t dfcom       df      riv    lambda       fmi
#> 1  mpg 5  0.006251264 1.056051e-04 2.705362e-04 4.302485e-04    29 3.422444 3.074127 0.7545486 0.8309841
#> 2  cyl 5 -0.135839559 7.279608e-04 2.480980e-03 3.705137e-03    29 2.868429 4.089748 0.8035266 0.8704861
#> 3   hp 5  0.001671181 5.697634e-07 1.424665e-06 2.279361e-06    29 3.474829 3.000540 0.7500337 0.8272454