Skip to contents

The code in this vignette requires marginaleffects version 0.6.0 or the development version hosted on Github.

This vignette introduces the hypotheses() function, and the hypothesis argument of the comparisons(), slopes(), and predictions() function. These features allow users to conduct linear and non-linear hypothesis tests and to compute custom contrasts (linear combinations) between parameters.

Null hypothesis

The simplest way to modify a hypothesis test is to change the null hypothesis. By default, all functions in the marginaleffects package assume that the null is 0. This can be changed by changing the hypothesis argument.

For example, consider a logistic regression model:

library(marginaleffects)
mod <- glm(am ~ hp + drat, data = mtcars, family = binomial)

We can compute the predicted outcome for a hypothetical unit where all regressors are fixed to their sample means:

predictions(mod, newdata = "mean")
#> 
#>  Estimate Std. Error     z Pr(>|z|)   2.5 % 97.5 %
#>     0.231     0.1429 1.616  0.10606 0.05842 0.5925
#> 
#> Prediction type:  response 
#> Columns: rowid, type, estimate, std.error, statistic, p.value, conf.low, conf.high, am, hp, drat

The Z statistic and p value reported above assume that the null hypothesis equals zero. We can change the null with the hypothesis argument:

predictions(mod, newdata = "mean", hypothesis = .5)
#> 
#>  Estimate Std. Error      z Pr(>|z|)   2.5 % 97.5 %
#>     0.231     0.1429 -1.883 0.059766 0.05842 0.5925
#> 
#> Prediction type:  response 
#> Columns: rowid, type, estimate, std.error, statistic, p.value, conf.low, conf.high, am, hp, drat

This can obviously be useful in other contexts. For instance, if we compute risk ratios (at the mean) associated with an increase of 1 unit in hp, it makes more sense to test the null hypothesis that the ratio of predictions is 1 rather than 0:

comparisons(
    mod,
    newdata = "mean",
    variables = "hp",
    transform_pre = "ratio",
    hypothesis = 1)
#> 
#>  Term Contrast Estimate Std. Error     z Pr(>|z|)  2.5 % 97.5 %
#>    hp       +1    1.008   0.007891 1.056  0.29085 0.9929  1.024
#> 
#> Prediction type:  response 
#> Columns: rowid, type, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, am, hp, drat, eps

Warning: Z statistics and p values are computed before applying functions in transform_post.

Hypothesis tests with the delta method

Version 0.6.0 of marginaleffects includes powerful function called hypotheses(). This function emulates the behavior of the well-established car::deltaMethod and car::linearHypothesis functions, but it supports more models, requires fewer dependencies, and offers some convenience features like shortcuts for robust standard errors.

hypotheses() can be used to compute estimates and standard errors of arbitrary functions of model parameters. For example, it can be used to conduct tests of equality between coefficients, or to test the value of some linear or non-linear combination of quantities of interest. hypotheses() can also be used to conduct hypothesis tests on other functions of a model’s parameter, such as adjusted predictions or marginal effects.

Let’s start by estimating a simple model:

library(marginaleffects)
mod <- lm(mpg ~ hp + wt + factor(cyl), data = mtcars)

When the FUN and hypothesis arguments of hypotheses() equal NULL (the default), the function returns a data.frame of raw estimates:

hypotheses(mod)
#> 
#>  Term Estimate Std. Error      z   Pr(>|z|)    2.5 %     97.5 %
#>    b1 35.84600    2.04102 17.563 < 2.22e-16 31.84567 39.8463192
#>    b2 -0.02312    0.01195 -1.934   0.053069 -0.04655  0.0003061
#>    b3 -3.18140    0.71960 -4.421 9.8215e-06 -4.59180 -1.7710120
#>    b4 -3.35902    1.40167 -2.396   0.016555 -6.10625 -0.6118027
#>    b5 -3.18588    2.17048 -1.468   0.142151 -7.43994  1.0681690
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Test of equality between coefficients:

hypotheses(mod, "hp = wt")
#> 
#>     Term Estimate Std. Error     z   Pr(>|z|) 2.5 % 97.5 %
#>  hp = wt    3.158     0.7199 4.387 1.1489e-05 1.747  4.569
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Non-linear function of coefficients

hypotheses(mod, "exp(hp + wt) = 0.1")
#> 
#>                Term Estimate Std. Error      z Pr(>|z|)   2.5 %    97.5 %
#>  exp(hp + wt) = 0.1 -0.05942     0.0292 -2.035 0.041832 -0.1166 -0.002196
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

The vcov argument behaves in the same was as in the slopes() function. It allows us to easily compute robust standard errors:

hypotheses(mod, "hp = wt", vcov = "HC3")
#> 
#>     Term Estimate Std. Error     z   Pr(>|z|) 2.5 % 97.5 %
#>  hp = wt    3.158     0.8052 3.922 8.7673e-05  1.58  4.736
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

We can use shortcuts like b1, b2, ... to identify the position of each parameter in the output of FUN. For example, b2=b3 is equivalent to hp=wt because those term names appear in the 2nd and 3rd row when we call hypotheses(mod).

hypotheses(mod, "b2 = b3")
#> 
#>     Term Estimate Std. Error     z   Pr(>|z|) 2.5 % 97.5 %
#>  b2 = b3    3.158     0.7199 4.387 1.1489e-05 1.747  4.569
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Term names with special characters must be enclosed in backticks:

hypotheses(mod, "`factor(cyl)6` = `factor(cyl)8`")
#> 
#>                             Term Estimate Std. Error       z Pr(>|z|)  2.5 % 97.5 %
#>  `factor(cyl)6` = `factor(cyl)8`  -0.1731      1.654 -0.1047  0.91663 -3.415  3.068
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

The FUN argument can be used to compute standard errors for arbitrary functions of model parameters. This user-supplied function must accept a single model object, and return a numeric vector or a data.frame with two columns named term and estimate.

mod <- glm(am ~ hp + mpg, data = mtcars, family = binomial)

f <- function(x) predict(x, type = "link", newdata = mtcars)
p <- hypotheses(mod, FUN = f)
head(p)
#> 
#>  Term Estimate Std. Error       z Pr(>|z|)  2.5 %  97.5 %
#>    b1  -1.0984     0.7160 -1.5339 0.125046 -2.502  0.3051
#>    b2  -1.0984     0.7160 -1.5339 0.125046 -2.502  0.3051
#>    b3   0.2332     0.7808  0.2986 0.765211 -1.297  1.7636
#>    b4  -0.5945     0.6471 -0.9187 0.358234 -1.863  0.6738
#>    b5  -0.4176     0.6475 -0.6449 0.518965 -1.687  0.8514
#>    b6  -5.0265     2.1949 -2.2901 0.022018 -9.328 -0.7245
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Test of equality between two predictions (row 2 vs row 3):

f <- function(x) predict(x, newdata = mtcars)
hypotheses(mod, FUN = f, hypothesis = "b2 = b3")
#> 
#>     Term Estimate Std. Error      z Pr(>|z|)  2.5 %  97.5 %
#>  b2 = b3   -1.332     0.6156 -2.163 0.030527 -2.538 -0.1251
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Note that we specified the newdata argument in the f function. This is because the predict() method associated with lm objects will automatically the original fitted values when newdata is NULL, instead of returning the slightly altered fitted values which we need to compute numerical derivatives in the delta method.

We can also use numeric vectors to specify linear combinations of parameters. For example, there are 3 coefficients in the last model we estimated. To test the null hypothesis that the sum of the 2nd and 3rd coefficients is equal to 0, we can do:

hypotheses(mod, hypothesis = c(0, 1, 1))
#> 
#>    Term Estimate Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  custom    1.315     0.5927 2.218 0.026551 0.153  2.476
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

See below for more example of how to use string formulas, numeric vectors, or matrices to calculate custom contrasts, linear combinations, and linear or non-linear hypothesis tests.

hypotheses Formulas

Each of the 4 core functions of the package support a hypothesis argument which behaves similarly to the hypotheses() function. This argument allows users to specify custom hypothesis tests and contrasts, in order to test null hypotheses such as:

  • The coefficients \(\beta_1\) and \(\beta_2\) are equal.
  • The marginal effects of \(X_1\) and \(X_2\) equal.
  • The marginal effect of \(X\) when \(W=0\) is equal to the marginal effect of \(X\) when \(W=1\).
  • A non-linear function of adjusted predictions is equal to 100.
  • The marginal mean in the control group is equal to the average of marginal means in the other 3 treatment arms.
  • Cross-level contrasts: In a multinomial model, the effect of \(X\) on the 1st outcome level is equal to the effect of \(X\) on the 2nd outcome level.

Marginal effects

For example, let’s fit a model and compute some marginal effects at the mean:

library(marginaleffects)

mod <- lm(mpg ~ am + vs, data = mtcars)

mfx <- slopes(mod, newdata = "mean")
mfx
#> 
#>  Term Estimate Std. Error     z   Pr(>|z|) 2.5 % 97.5 %
#>    am    6.067      1.275 4.759 1.9479e-06 3.568  8.565
#>    vs    6.929      1.262 5.490 4.0146e-08 4.456  9.403
#> 
#> Prediction type:  response 
#> Columns: rowid, type, term, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, mpg, am, vs, eps

Is the marginal effect of am different from the marginal effect of vs? To answer this question we can run a linear hypothesis test using the hypotheses function:

hypotheses(mfx, hypothesis = "am = vs")
#> 
#>   Term Estimate Std. Error       z Pr(>|z|)  2.5 % 97.5 %
#>  am=vs  -0.8627      1.939 -0.4449  0.65639 -4.663  2.938
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

Alternatively, we can specify the hypothesis directly in the original call:

library(marginaleffects)

mod <- lm(mpg ~ am + vs, data = mtcars)

slopes(
    mod,
    newdata = "mean",
    hypothesis = "am = vs")
#> 
#>   Term Estimate Std. Error       z Pr(>|z|)  2.5 % 97.5 %
#>  am=vs  -0.8627      1.939 -0.4449  0.65639 -4.663  2.938
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

The hypotheses string can include any valid R expression, so we can run some silly non-linear tests:

slopes(
    mod,
    newdata = "mean",
    hypothesis = "exp(am) - 2 * vs = -400")
#> 
#>               Term Estimate Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  exp(am)-2*vs=-400    817.4      550.2 1.486   0.1374  -261   1896
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

Adjusted Predictions

Now consider the case of adjusted predictions:

p <- predictions(
    mod,
    newdata = datagrid(am = 0:1, vs = 0:1))
p
#> 
#>  Estimate Std. Error     z   Pr(>|z|) 2.5 % 97.5 % am vs
#>     14.59     0.9262 15.76 < 2.22e-16 12.78  16.41  0  0
#>     21.52     1.1300 19.05 < 2.22e-16 19.31  23.74  0  1
#>     20.66     1.1830 17.46 < 2.22e-16 18.34  22.98  1  0
#>     27.59     1.1300 24.42 < 2.22e-16 25.38  29.81  1  1
#> 
#> Prediction type:  response 
#> Columns: rowid, type, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, am, vs

Since there is no term column in the output of the predictions function, we must use parameter identifiers like b1, b2, etc. to determine which estimates we want to compare:

hypotheses(p, hypothesis = "b1 = b2")
#> 
#>   Term Estimate Std. Error     z   Pr(>|z|)  2.5 % 97.5 %
#>  b1=b2   -6.929      1.262 -5.49 4.0146e-08 -9.403 -4.456
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

Or directly:

predictions(
    mod,
    hypothesis = "b1 = b2",
    newdata = datagrid(am = 0:1, vs = 0:1))
#> 
#>   Term Estimate Std. Error     z   Pr(>|z|)  2.5 % 97.5 %
#>  b1=b2   -6.929      1.262 -5.49 4.0146e-08 -9.403 -4.456
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

p$estimate[1] - p$estimate[2]
#> [1] -6.929365

In the next section, we will see that we can get equivalent results by using a vector of contrast weights, which will be used to compute a linear combination of estimates:

predictions(
    mod,
    hypothesis = c(1, -1, 0, 0),
    newdata = datagrid(am = 0:1, vs = 0:1))
#> 
#>    Term Estimate Std. Error     z   Pr(>|z|)  2.5 % 97.5 %
#>  custom   -6.929      1.262 -5.49 4.0146e-08 -9.403 -4.456
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

There are many more possibilities:

predictions(
    mod,
    hypothesis = "b1 + b2 = 30",
    newdata = datagrid(am = 0:1, vs = 0:1))
#> 
#>      Term Estimate Std. Error    z   Pr(>|z|) 2.5 % 97.5 %
#>  b1+b2=30    6.118      1.636 3.74 0.00018417 2.912  9.325
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

p$estimate[1] + p$estimate[2] - 30
#> [1] 6.118254

predictions(
    mod,
    hypothesis = "(b2 - b1) / (b3 - b2) = 0",
    newdata = datagrid(am = 0:1, vs = 0:1))
#> 
#>               Term Estimate Std. Error       z Pr(>|z|)  2.5 % 97.5 %
#>  (b2-b1)/(b3-b2)=0   -8.032      16.97 -0.4734  0.63591 -41.29  25.22
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

Average contrasts or marginal effects

The standard workflow with the marginaleffects package is to call a function like predictions(), slopes() or comparisons() to compute unit-level quantities; or one of their cousins avg_predictions(), avg_comparisons(), or avg_slopes() to aggregate the unit-level quantities into “Average Marginal Effects” or “Average Contrasts.” We can also use the transform_pre argument to emulate the behavior of the avg_*() functions.

First, note that these three commands produce the same results:

comparisons(mod, variables = "vs")$estimate |> mean()
#> [1] 6.929365

avg_comparisons(mod, variables = "vs")
#> 
#>  Term Contrast Estimate Std. Error    z   Pr(>|z|) 2.5 % 97.5 %
#>    vs    1 - 0    6.929      1.262 5.49 4.0146e-08 4.456  9.403
#> 
#> Prediction type:  response 
#> Columns: type, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high

comparisons(
    mod,
    variables = "vs",
    transform_pre = "differenceavg")
#> 
#>  Term          Contrast Estimate Std. Error    z   Pr(>|z|) 2.5 % 97.5 %
#>    vs mean(1) - mean(0)    6.929      1.262 5.49 4.0146e-08 4.456  9.403
#> 
#> Prediction type:  response 
#> Columns: type, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

See the transformations section of the Contrasts vignette for more details.

With these results in hand, we can now conduct a linear hypothesis test between average marginal effects:

comparisons(
    mod,
    hypothesis = "am = vs",
    transform_pre = "differenceavg")
#> 
#>   Term Estimate Std. Error       z Pr(>|z|)  2.5 % 97.5 %
#>  am=vs  -0.8627      1.939 -0.4449  0.65639 -4.663  2.938
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

Computing contrasts between average marginal effects requires a little care to obtain the right scale. In particular, we need to specify both the variables and the transform_pre:

comparisons(
    mod,
    hypothesis = "am = vs",
    variables = c("am", "vs"),
    transform_pre = "dydxavg")
#> 
#>   Term Estimate Std. Error       z Pr(>|z|)  2.5 % 97.5 %
#>  am=vs  -0.8627      1.939 -0.4449  0.65639 -4.663  2.938
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

hypotheses Vectors and Matrices

The marginal_means() function computes estimated marginal means. The hypothesis argument of that function offers a powerful mechanism to estimate custom contrasts between marginal means, by way of linear combination.

Consider a simple example:

library(marginaleffects)
library(emmeans)
library(nnet)

dat <- mtcars
dat$carb <- factor(dat$carb)
dat$cyl <- factor(dat$cyl)
dat$am <- as.logical(dat$am)

mod <- lm(mpg ~ carb + cyl, dat)
mm <- marginal_means(mod, variables = "carb")
mm
#> 
#>  Term Value  Mean Std. Error      z   Pr(>|z|) 2.5 % 97.5 %
#>  carb     1 21.66      1.438 15.060 < 2.22e-16 18.84  24.48
#>  carb     2 21.34      1.235 17.285 < 2.22e-16 18.92  23.76
#>  carb     3 21.42      2.192  9.772 < 2.22e-16 17.12  25.71
#>  carb     4 18.88      1.211 15.595 < 2.22e-16 16.51  21.26
#>  carb     6 19.76      3.551  5.564 2.6299e-08 12.80  26.72
#>  carb     8 20.12      3.510  5.731 9.9847e-09 13.24  27.00
#> 
#> Prediction type:  response 
#> Results averaged over levels of: carb, cyl 
#> Columns: type, term, value, carb, estimate, std.error, statistic, p.value, conf.low, conf.high

The contrast between marginal means for carb==1 and carb==2 is:

21.66232 - 21.34058 
#> [1] 0.32174

or

21.66232 + -(21.34058)
#> [1] 0.32174

or

sum(c(21.66232, 21.34058) * c(1, -1))
#> [1] 0.32174

or

c(21.66232, 21.34058) %*% c(1, -1)
#>         [,1]
#> [1,] 0.32174

The last two commands express the contrast of interest as a linear combination of marginal means.

Simple contrast

In the marginal_means() function, we can supply a hypothesis argument to compute linear combinations of marginal means. This argument must be a numeric vector of the same length as the number of rows in the output of marginal_means(). For example, in the previous there were six rows, and the two marginal means we want to compare are at in the first two positions:

lc <- c(1, -1, 0, 0, 0, 0)
marginal_means(mod, variables = "carb", hypothesis = lc)
#> 
#>    Term   Mean Std. Error      z Pr(>|z|)  2.5 % 97.5 %
#>  custom 0.3217      1.774 0.1814  0.85606 -3.155  3.798
#> 
#> Prediction type:  response 
#> Results averaged over levels of: carb, cyl 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Complex contrast

Of course, we can also estimate more complex contrasts:

lc <- c(-2, 1, 1, 0, -1, 1)
marginal_means(mod, variables = "carb", hypothesis = lc)
#> 
#>    Term    Mean Std. Error        z Pr(>|z|)  2.5 % 97.5 %
#>  custom -0.2109      6.929 -0.03043  0.97572 -13.79  13.37
#> 
#> Prediction type:  response 
#> Results averaged over levels of: carb, cyl 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

emmeans produces similar results:

library(emmeans)
em <- emmeans(mod, "carb")
lc <- data.frame(custom_contrast = c(-2, 1, 1, 0, -1, 1))
contrast(em, method = lc)
#>  contrast        estimate   SE df t.ratio p.value
#>  custom_contrast   -0.211 6.93 24  -0.030  0.9760
#> 
#> Results are averaged over the levels of: cyl

Multiple contrasts

Users can also compute multiple linear combinations simultaneously by supplying a numeric matrix to hypotheses. This matrix must have the same number of rows as the output of slopes(), and each column represents a distinct set of weights for different linear combinations. The column names of the matrix become labels in the output. For example:

lc <- matrix(c(
    -2, 1, 1, 0, -1, 1,
    1, -1, 0, 0, 0, 0
    ), ncol = 2)
colnames(lc) <- c("Contrast A", "Contrast B")
lc
#>      Contrast A Contrast B
#> [1,]         -2          1
#> [2,]          1         -1
#> [3,]          1          0
#> [4,]          0          0
#> [5,]         -1          0
#> [6,]          1          0

marginal_means(mod, variables = "carb", hypothesis = lc)
#> 
#>        Term    Mean Std. Error        z Pr(>|z|)   2.5 % 97.5 %
#>  Contrast A -0.2109      6.929 -0.03043  0.97572 -13.791 13.369
#>  Contrast B  0.3217      1.774  0.18139  0.85606  -3.155  3.798
#> 
#> Prediction type:  response 
#> Results averaged over levels of: carb, cyl 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Contrasts across response levels

In models with multinomial outcomes, one may be interested in comparing outcomes or contrasts across response levels. For example, in this model there are 18 estimated marginal means, across 6 outcome levels (the group column):

library(nnet)
mod <- multinom(carb ~ mpg + cyl, data = dat, trace = FALSE)
mm <- marginal_means(mod, type = "probs")
mm
#> 
#>  Group Term Value      Mean Std. Error
#>      1  cyl     4 3.679e-01  2.598e-01
#>      2  cyl     4 6.311e-01  2.601e-01
#>      3  cyl     4 6.853e-04  1.192e-02
#>      4  cyl     4 2.119e-04  1.750e-02
#>      6  cyl     4 8.815e-06  8.407e-05
#>      8  cyl     4 1.497e-04  7.970e-03
#>      1  cyl     6 2.829e-01  1.957e-01
#>      2  cyl     6 1.849e-06  2.395e-06
#>      3  cyl     6 1.122e-05  1.323e-03
#>      4  cyl     6 5.563e-01  2.185e-01
#>      6  cyl     6 1.608e-01  1.540e-01
#>      8  cyl     6 9.292e-06  7.982e-04
#>      1  cyl     8 4.625e-04  9.587e-03
#>      2  cyl     8 6.655e-01  3.745e-01
#>      3  cyl     8 3.104e-01  3.715e-01
#>      4  cyl     8 9.582e-03  2.289e-02
#>      6  cyl     8 4.346e-09  9.895e-08
#>      8  cyl     8 1.412e-02  4.673e-02
#> 
#> Prediction type:  probs 
#> Results averaged over levels of: cyl 
#> Columns: type, group, term, value, cyl, estimate, std.error

Let’s contrast the marginal means in the first outcome level when cyl equals 4 and 6. These marginal means are located in rows 1 and 7 respectively:

lc <- rep(0, nrow(mm))
lc[1] <- -1
lc[7] <- 1
marginal_means(
    mod,
    type = "probs",
    hypothesis = lc)
#> 
#>    Term     Mean Std. Error
#>  custom -0.08498     0.3207
#> 
#> Prediction type:  probs 
#> Results averaged over levels of: cyl 
#> Columns: term, estimate, std.error

This is indeed equal to the results we would have obtained manually:

2.828726e-01 - 3.678521e-01
#> [1] -0.0849795

Now let’s say we want to calculate a “contrast in contrasts”, that is, the outcome of a 3-step process:

  1. Contrast between cyl=6 and cyl=4 in the 1st outcome level
  2. Contrast between cyl=6 and cyl=4 in the 2nd outcome level
  3. Contrast between the contrasts defined in steps 1 and 2.

We create the linear combination weights as follows:

lc <- rep(0, nrow(mm))
lc[c(1, 8)] <- -1
lc[c(7, 2)] <- 1

To make sure that the weights are correct, we can display them side by side with the original marginal_means() output:

transform(mm[, 1:3], lc = lc)
#>     type group term lc
#> 1  probs     1  cyl -1
#> 2  probs     2  cyl  1
#> 3  probs     3  cyl  0
#> 4  probs     4  cyl  0
#> 5  probs     6  cyl  0
#> 6  probs     8  cyl  0
#> 7  probs     1  cyl  1
#> 8  probs     2  cyl -1
#> 9  probs     3  cyl  0
#> 10 probs     4  cyl  0
#> 11 probs     6  cyl  0
#> 12 probs     8  cyl  0
#> 13 probs     1  cyl  0
#> 14 probs     2  cyl  0
#> 15 probs     3  cyl  0
#> 16 probs     4  cyl  0
#> 17 probs     6  cyl  0
#> 18 probs     8  cyl  0

Compute the results:

marginal_means(mod, type = "probs", hypothesis = lc)
#> 
#>    Term   Mean Std. Error
#>  custom 0.5461     0.5498
#> 
#> Prediction type:  probs 
#> Results averaged over levels of: cyl 
#> Columns: term, estimate, std.error

Pairwise contrasts: Difference-in-Differences

Now we illustrate how to use the machinery described above to do pairwise comparisons between contrasts, a type of analysis often associated with a “Difference-in-Differences” research design.

First, we simulate data with two treatment groups and pre/post periods:

library(data.table)

N <- 1000
did <- data.table(
    id = 1:N,
    pre = rnorm(N),
    trt = sample(0:1, N, replace = TRUE))
did$post <- did$pre + did$trt * 0.3 + rnorm(N)
did <- melt(
    did,
    value.name = "y",
    variable.name = "time",
    id.vars = c("id", "trt"))
head(did)
#>    id trt time          y
#> 1:  1   0  pre -0.3065788
#> 2:  2   1  pre -1.1005951
#> 3:  3   1  pre -1.1808039
#> 4:  4   1  pre -0.3950081
#> 5:  5   0  pre -0.5322512
#> 6:  6   0  pre  0.4174413

Then, we estimate a linear model with a multiple interaction between the time and the treatment indicators. We also compute contrasts at the mean for each treatment level:

did_model <- lm(y ~ time * trt, data = did)

comparisons(
    did_model,
    newdata = datagrid(trt = 0:1),
    variables = "time")
#> 
#>  Term   Contrast Estimate Std. Error      z   Pr(>|z|)   2.5 % 97.5 % trt
#>  time post - pre  0.02833    0.07812 0.3626 0.71692913 -0.1248 0.1814   0
#>  time post - pre  0.26204    0.07828 3.3475 0.00081539  0.1086 0.4155   1
#> 
#> Prediction type:  response 
#> Columns: rowid, type, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, y, time, trt

Finally, we compute pairwise differences between contrasts. This is the Diff-in-Diff estimate:

comparisons(
    did_model,
    variables = "time",
    newdata = datagrid(trt = 0:1),
    hypothesis = "pairwise")
#> 
#>           Term Estimate Std. Error      z Pr(>|z|)   2.5 %   97.5 %
#>  Row 1 - Row 2  -0.2337     0.1106 -2.113 0.034575 -0.4505 -0.01696
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

Equivalence, non-inferiority, and non-superiority tests

In many contexts, analysts are less interested in rejecting a null hypothesis, and more interested in testing whether an estimate is “inferior”, “superior”, or “equivalent” to a given threshold or interval. For example, medical researchers may wish to determine if the estimated effect of a new treatment is larger than the effect of prior treatments, or larger than some threshold of “clinical significance.” Alternatively, researchers may wish to support a claim that an estimated parameter is “equivalent to” or “not meaningfully different from” a null hypothesis.

To answer these questions, we can use non-inferiority, non-superiority, or equivalence tests like the two-one-sided test (TOST). This article gives a primer and tutorial on TOST:

Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science. 2018;1(2):259-269. doi:10.1177/2515245918770963

The hypotheses() function of the marginaleffects package includes an equivalence argument which allows users to apply these tests to any of the quantities generated by the package, as well as to arbitrary functions of a model’s parameters. To illustrate, we begin by estimating a simple linear regression model:

mod <- lm(mpg ~ hp + factor(gear), data = mtcars)

The rest of this section considers several quantities estimated by marginaleffects.

Example 1: Predictions

Consider a single prediction, where all predictors are held at their median or mode:

p <- predictions(mod, newdata = "median")
p
#> 
#>  Estimate Std. Error     z   Pr(>|z|) 2.5 % 97.5 %
#>     19.66      1.004 19.59 < 2.22e-16 17.69  21.63
#> 
#> Prediction type:  response 
#> Columns: rowid, type, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, hp, gear

Now we specify an equivalence interval (or “region”) for predictions between 17 and 18:

hypotheses(p, equivalence = c(17, 18))
#> 
#>  Estimate Std. Error     z   Pr(>|z|) 2.5 % 97.5 %     p (Inf)   p (Sup)    p (Eq)
#>     19.66      1.004 19.59 < 2.22e-16 17.69  21.63 0.004037228 0.9508009 0.9508009
#> 
#> Prediction type:  response 
#> Columns: rowid, type, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, hp, gear, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

The results allow us to draw three conclusions:

  1. The p value for the non-inferiority test is 0.0040. This suggests that we can reject the null hypothesis that the parameter is below 17.
  2. The p value for the non-superiority test is 0.9508. This suggests that we cannot reject the null hypothesis that the parameter (19.6589) is above 18.
  3. The p value for the equivalence test is 0.9508. This suggests that we cannot reject the hypothesis that the parameter falls outside the equivalence interval.
Example 2: Model coefficients

The hypotheses function also allows users to conduct equivalence, non-inferiority, and non-superiority tests for model coefficients, and for arbitrary functions of model coefficients.

Our estimate of the 4th coefficient in the model is:

coef(mod)[4]
#> factor(gear)5 
#>      6.574763

We can test if this parameter is likely to fall in the [5,7] interval by:

hypotheses(mod, equivalence = c(5, 7))[4, ]
#> 
#>  Term Estimate Std. Error     z   Pr(>|z|) 2.5 % 97.5 %   p (Inf)   p (Sup)    p (Eq)
#>    b4    6.575      1.643 4.002 6.2689e-05 3.355  9.794 0.1688669 0.3978688 0.3978688
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

The p value is 0.3979, so we cannot reject the hypothesis that the factor(gear)5 parameter falls outside the [5,7] interval.

Example 3: Slopes

The same syntax can be used to conduct tests for all the quantities produced by the marginaleffects package. For example, imagine that, for substantive or theoretical reasons, an average slope between -0.1 and 0.1 is uninteresting. We can conduct an equivalence test to check if this is the case:

avg_slopes(mod, variables = "hp") |>
    hypotheses(equivalence = c(-.1, .1))
#> 
#>  Term Estimate Std. Error      z   Pr(>|z|)   2.5 %  97.5 %    p (Inf)      p (Sup)     p (Eq)
#>    hp -0.06685    0.01105 -6.052 1.4269e-09 -0.0885 -0.0452 0.00134666 7.442542e-52 0.00134666
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

The p value is 0.0013, which suggests that we can reject the hypothesis that the parameter falls outside the region of “substantive equivalence” that we have defined by the interval.

Example 4: Difference between comparisons (contrasts)

Consider a model with a multiplicative interaction:

int <- lm(mpg ~ hp * factor(gear), data = mtcars)

The average contrast for a change of 1 unit in hp differs based on the value of gear:

avg_comparisons(int, variables = "hp", by = "gear")
#> 
#>  Term Contrast gear Estimate Std. Error      z   Pr(>|z|)    2.5 %   97.5 %
#>    hp mean(+1)    4 -0.17919    0.03025 -5.923 3.1628e-09 -0.23848 -0.11989
#>    hp mean(+1)    3 -0.05224    0.01456 -3.588 0.00033337 -0.08078 -0.02370
#>    hp mean(+1)    5 -0.05827    0.01263 -4.613 3.9749e-06 -0.08303 -0.03351
#> 
#> Prediction type:  response 
#> Columns: type, term, contrast, gear, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

Are these contrasts different from one another? Let’s look at the pairwise differences between them:

avg_comparisons(int, variables = "hp", by = "gear", hypothesis = "pairwise")
#> 
#>                           Term  Estimate Std. Error       z   Pr(>|z|)    2.5 %   97.5 %
#>  4,hp,mean(+1) - 3,hp,mean(+1) -0.126946    0.03357 -3.7810 0.00015618 -0.19275 -0.06114
#>  4,hp,mean(+1) - 5,hp,mean(+1) -0.120917    0.03278 -3.6882 0.00022581 -0.18517 -0.05666
#>  3,hp,mean(+1) - 5,hp,mean(+1)  0.006029    0.01928  0.3128 0.75445655 -0.03175  0.04381
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high

We consider that these pairwise comparisons are “equivalent to zero” when they fall in the [-.1, .1] interval:

avg_comparisons(int, variables = "hp", by = "gear", hypothesis = "pairwise") |>
    hypotheses(equivalence = c(-.1, .1))
#> 
#>                           Term  Estimate Std. Error       z   Pr(>|z|)    2.5 %   97.5 %      p (Inf)      p (Sup)
#>  4,hp,mean(+1) - 3,hp,mean(+1) -0.126946    0.03357 -3.7810 0.00015618 -0.19275 -0.06114 7.888916e-01 6.924195e-12
#>  4,hp,mean(+1) - 5,hp,mean(+1) -0.120917    0.03278 -3.6882 0.00022581 -0.18517 -0.05666 7.382704e-01 8.003656e-12
#>  3,hp,mean(+1) - 5,hp,mean(+1)  0.006029    0.01928  0.3128 0.75445655 -0.03175  0.04381 1.893635e-08 5.441587e-07
#>        p (Eq)
#>  7.888916e-01
#>  7.382704e-01
#>  5.441587e-07
#> 
#> Prediction type:  response 
#> Columns: type, term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

The p (Eq.) column shows that the difference between the average contrasts when gear is 3 and gear is 5 can be said to be equivalent to the specified interval. However, there are good reasons to think that the other two pairwise comparisons may fall outside the interval.

Example 5: Marginal means and emmeans

This example shows the equivalence between results produced by the emmeans package and the marginal_means() function:

library(emmeans)

mod <- lm(log(conc) ~ source + factor(percent), data = pigs)

# {emmeans}
emmeans(mod, specs = "source") |>
    pairs() |>
    test(df = Inf,
         null = 0,
         delta = log(1.25),
         side = "equivalence",
         adjust = "none")
#>  contrast    estimate     SE  df z.ratio p.value
#>  fish - soy    -0.273 0.0529 Inf   0.937  0.8257
#>  fish - skim   -0.402 0.0542 Inf   3.308  0.9995
#>  soy - skim    -0.130 0.0530 Inf  -1.765  0.0388
#> 
#> Results are averaged over the levels of: percent 
#> Degrees-of-freedom method: user-specified 
#> Results are given on the log (not the response) scale. 
#> Statistics are tests of equivalence with a threshold of 0.22314 
#> P values are left-tailed

# {marginaleffects}
marginal_means(
    mod,
    variables = "source",
    hypothesis = "pairwise") |>
    hypotheses(equivalence = c(-log(1.25), log(1.25)))
#> 
#>         Term    Mean Std. Error      z   Pr(>|z|)   2.5 %   97.5 %     p (Eq)    p (Inf)      p (Sup)
#>   fish - soy -0.2728    0.05293 -5.153 2.5645e-07 -0.3765 -0.16902 0.82574055 0.82574055 3.682126e-21
#>  fish - skim -0.4023    0.05416 -7.428 1.1052e-13 -0.5084 -0.29613 0.99952941 0.99952941 3.786224e-31
#>   soy - skim -0.1295    0.05304 -2.442   0.014622 -0.2335 -0.02555 0.03876101 0.03876101 1.480788e-11
#> 
#> Prediction type:  response 
#> Results averaged over levels of: source, percent 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, p.value.equiv, p.value.noninf, p.value.nonsup, statistic.noninf, statistic.nonsup
Example 6: t-test

Now we show that the results produced by hypotheses() are identical to the results produced by the equivalence package in the case of a simple t-test:

library(equivalence)

set.seed(1024)

# simulate data data
N <- 20
dat <- data.frame(
    y = rnorm(N),
    x = sample(c(rep(0, N / 2), rep(1, N / 2)), N))

# fit model
mod <- lm(y ~ x, data = dat)

# test with the {equivalence} package
e <- tost(
    x = dat$y[dat$x == 0],
    y = dat$y[dat$x == 1],
    epsilon = 10)
e
#> 
#>  Welch Two Sample TOST
#> 
#> data:  dat$y[dat$x == 0] and dat$y[dat$x == 1]
#> df = 17.607
#> sample estimates:
#>  mean of x  mean of y 
#> -0.3788551 -0.2724594 
#> 
#> Epsilon: 10 
#> 95 percent two one-sided confidence interval (TOST interval):
#>  -1.058539  0.845747
#> Null hypothesis of statistical difference is: rejected 
#> TOST p-value: 4.248528e-13

# test with {marginaleffects} package
hypotheses(mod, equivalence = c(-10, 10), df = e$parameter)[2,]
#> 
#>  Term Estimate Std. Error     z Pr(>|z|)  2.5 % 97.5 %      p (Inf)      p (Sup)       p (Eq)
#>    b2   0.1064     0.5484 0.194  0.84839 -1.048   1.26 2.973612e-13 4.248528e-13 4.248528e-13
#> 
#> Prediction type:  
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, df, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv