Skip to contents

The code in this vignette requires marginaleffects version 0.6.0 or the development version hosted on Github.

This vignette introduces the hypotheses() function, and the hypothesis argument of the comparisons(), slopes(), and predictions() function. These features allow users to conduct linear and non-linear hypothesis tests and to compute custom contrasts (linear combinations) between parameters.

Null hypothesis

The simplest way to modify a hypothesis test is to change the null hypothesis. By default, all functions in the marginaleffects package assume that the null is 0. This can be changed by changing the hypothesis argument.

For example, consider a logistic regression model:

library(marginaleffects)
mod <- glm(am ~ hp + drat, data = mtcars, family = binomial)

We can compute the predicted outcome for a hypothetical unit where all regressors are fixed to their sample means:

predictions(mod, newdata = "mean")
#> 
#>  Estimate Pr(>|z|)  2.5 % 97.5 %  hp drat
#>     0.231    0.135 0.0584  0.592 147  3.6
#> 
#> Columns: rowid, estimate, p.value, conf.low, conf.high, am, hp, drat

The Z statistic and p value reported above assume that the null hypothesis equals zero. We can change the null with the hypothesis argument:

predictions(mod, newdata = "mean", hypothesis = .5)
#> 
#>  Estimate Pr(>|z|)  2.5 % 97.5 %  hp drat
#>     0.231   0.0343 0.0584  0.592 147  3.6
#> 
#> Columns: rowid, estimate, p.value, conf.low, conf.high, am, hp, drat

This can obviously be useful in other contexts. For instance, if we compute risk ratios (at the mean) associated with an increase of 1 unit in hp, it makes more sense to test the null hypothesis that the ratio of predictions is 1 rather than 0:

comparisons(
    mod,
    newdata = "mean",
    variables = "hp",
    comparison = "ratio",
    hypothesis = 1) |>
    print(digits = 3)
#> 
#>  Term Contrast Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>    hp       +1     1.01    0.00793 1.05    0.293 0.993   1.02
#> 
#> Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, am, hp, drat

Warning: Z statistics and p values are computed before applying functions in transform.

Hypothesis tests with the delta method

Version 0.6.0 of marginaleffects includes powerful function called hypotheses(). This function emulates the behavior of the well-established car::deltaMethod and car::linearHypothesis functions, but it supports more models, requires fewer dependencies, and offers some convenience features like shortcuts for robust standard errors.

hypotheses() can be used to compute estimates and standard errors of arbitrary functions of model parameters. For example, it can be used to conduct tests of equality between coefficients, or to test the value of some linear or non-linear combination of quantities of interest. hypotheses() can also be used to conduct hypothesis tests on other functions of a model’s parameter, such as adjusted predictions or marginal effects.

Let’s start by estimating a simple model:

library(marginaleffects)
mod <- lm(mpg ~ hp + wt + factor(cyl), data = mtcars)

When the FUN and hypothesis arguments of hypotheses() equal NULL (the default), the function returns a data.frame of raw estimates:

hypotheses(mod)
#> 
#>  Term Estimate Std. Error     z Pr(>|z|)   2.5 %    97.5 %
#>    b1  35.8460      2.041 17.56   <0.001 31.8457 39.846319
#>    b2  -0.0231      0.012 -1.93   0.0531 -0.0465  0.000306
#>    b3  -3.1814      0.720 -4.42   <0.001 -4.5918 -1.771012
#>    b4  -3.3590      1.402 -2.40   0.0166 -6.1062 -0.611803
#>    b5  -3.1859      2.170 -1.47   0.1422 -7.4399  1.068169
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Test of equality between coefficients:

hypotheses(mod, "hp = wt")
#> 
#>     Term Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>  hp = wt     3.16       0.72 4.39   <0.001  1.75   4.57
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Non-linear function of coefficients

hypotheses(mod, "exp(hp + wt) = 0.1")
#> 
#>                Term Estimate Std. Error     z Pr(>|z|)  2.5 %  97.5 %
#>  exp(hp + wt) = 0.1  -0.0594     0.0292 -2.04   0.0418 -0.117 -0.0022
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

The vcov argument behaves in the same was as in the slopes() function. It allows us to easily compute robust standard errors:

hypotheses(mod, "hp = wt", vcov = "HC3")
#> 
#>     Term Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>  hp = wt     3.16      0.805 3.92   <0.001  1.58   4.74
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

We can use shortcuts like b1, b2, ... to identify the position of each parameter in the output of FUN. For example, b2=b3 is equivalent to hp=wt because those term names appear in the 2nd and 3rd row when we call hypotheses(mod).

hypotheses(mod, "b2 = b3")
#> 
#>     Term Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>  b2 = b3     3.16       0.72 4.39   <0.001  1.75   4.57
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Term names with special characters must be enclosed in backticks:

hypotheses(mod, "`factor(cyl)6` = `factor(cyl)8`")
#> 
#>                             Term Estimate Std. Error      z Pr(>|z|) 2.5 % 97.5 %
#>  `factor(cyl)6` = `factor(cyl)8`   -0.173       1.65 -0.105    0.917 -3.41   3.07
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

The FUN argument can be used to compute standard errors for arbitrary functions of model parameters. This user-supplied function must accept a single model object, and return a numeric vector or a data.frame with two columns named term and estimate.

mod <- glm(am ~ hp + mpg, data = mtcars, family = binomial)

f <- function(x) predict(x, type = "link", newdata = mtcars)
p <- hypotheses(mod, FUN = f)
head(p)
#> 
#>  Term Estimate Std. Error      z Pr(>|z|) 2.5 % 97.5 %
#>    b1   -1.098      0.716 -1.534    0.125 -2.50  0.305
#>    b2   -1.098      0.716 -1.534    0.125 -2.50  0.305
#>    b3    0.233      0.781  0.299    0.765 -1.30  1.764
#>    b4   -0.595      0.647 -0.919    0.358 -1.86  0.674
#>    b5   -0.418      0.647 -0.645    0.519 -1.69  0.851
#>    b6   -5.026      2.195 -2.290    0.022 -9.33 -0.725
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Test of equality between two predictions (row 2 vs row 3):

f <- function(x) predict(x, newdata = mtcars)
hypotheses(mod, FUN = f, hypothesis = "b2 = b3")
#> 
#>     Term Estimate Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  b2 = b3    -1.33      0.616 -2.16   0.0305 -2.54 -0.125
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Note that we specified the newdata argument in the f function. This is because the predict() method associated with lm objects will automatically the original fitted values when newdata is NULL, instead of returning the slightly altered fitted values which we need to compute numerical derivatives in the delta method.

We can also use numeric vectors to specify linear combinations of parameters. For example, there are 3 coefficients in the last model we estimated. To test the null hypothesis that the sum of the 2nd and 3rd coefficients is equal to 0, we can do:

hypotheses(mod, hypothesis = c(0, 1, 1))
#> 
#>    Term Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>  custom     1.31      0.593 2.22   0.0266 0.153   2.48
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

See below for more example of how to use string formulas, numeric vectors, or matrices to calculate custom contrasts, linear combinations, and linear or non-linear hypothesis tests.

hypotheses Formulas

Each of the 4 core functions of the package support a hypothesis argument which behaves similarly to the hypotheses() function. This argument allows users to specify custom hypothesis tests and contrasts, in order to test null hypotheses such as:

  • The coefficients \(\beta_1\) and \(\beta_2\) are equal.
  • The marginal effects of \(X_1\) and \(X_2\) equal.
  • The marginal effect of \(X\) when \(W=0\) is equal to the marginal effect of \(X\) when \(W=1\).
  • A non-linear function of adjusted predictions is equal to 100.
  • The marginal mean in the control group is equal to the average of marginal means in the other 3 treatment arms.
  • Cross-level contrasts: In a multinomial model, the effect of \(X\) on the 1st outcome level is equal to the effect of \(X\) on the 2nd outcome level.

Marginal effects

For example, let’s fit a model and compute some marginal effects at the mean:

library(marginaleffects)

mod <- lm(mpg ~ am + vs, data = mtcars)

mfx <- slopes(mod, newdata = "mean")
mfx
#> 
#>  Term Contrast Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>    am    1 - 0     6.07       1.27 4.76   <0.001  3.57   8.57
#>    vs    1 - 0     6.93       1.26 5.49   <0.001  4.46   9.40
#> 
#> Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, mpg, am, vs

Is the marginal effect of am different from the marginal effect of vs? To answer this question we can run a linear hypothesis test using the hypotheses function:

hypotheses(mfx, hypothesis = "am = vs")
#> 
#>   Term Estimate Std. Error      z Pr(>|z|) 2.5 % 97.5 %
#>  am=vs   -0.863       1.94 -0.445    0.656 -4.66   2.94
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Alternatively, we can specify the hypothesis directly in the original call:

library(marginaleffects)

mod <- lm(mpg ~ am + vs, data = mtcars)

slopes(
    mod,
    newdata = "mean",
    hypothesis = "am = vs")
#> 
#>   Term Estimate Std. Error      z Pr(>|z|) 2.5 % 97.5 %
#>  am=vs   -0.863       1.94 -0.445    0.656 -4.66   2.94
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

The hypotheses string can include any valid R expression, so we can run some silly non-linear tests:

slopes(
    mod,
    newdata = "mean",
    hypothesis = "exp(am) - 2 * vs = -400")
#> 
#>               Term Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>  exp(am)-2*vs=-400      817        550 1.49    0.137  -261   1896
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Adjusted Predictions

Now consider the case of adjusted predictions:

p <- predictions(
    mod,
    newdata = datagrid(am = 0:1, vs = 0:1))
p
#> 
#>  Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 % am vs
#>      14.6      0.926 15.8   <0.001  12.8   16.4  0  0
#>      21.5      1.130 19.0   <0.001  19.3   23.7  0  1
#>      20.7      1.183 17.5   <0.001  18.3   23.0  1  0
#>      27.6      1.130 24.4   <0.001  25.4   29.8  1  1
#> 
#> Columns: rowid, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, am, vs

Since there is no term column in the output of the predictions function, we must use parameter identifiers like b1, b2, etc. to determine which estimates we want to compare:

hypotheses(p, hypothesis = "b1 = b2")
#> 
#>   Term Estimate Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  b1=b2    -6.93       1.26 -5.49   <0.001  -9.4  -4.46
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Or directly:

predictions(
    mod,
    hypothesis = "b1 = b2",
    newdata = datagrid(am = 0:1, vs = 0:1))
#> 
#>   Term Estimate Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  b1=b2    -6.93       1.26 -5.49   <0.001  -9.4  -4.46
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

p$estimate[1] - p$estimate[2]
#> [1] -6.929365

In the next section, we will see that we can get equivalent results by using a vector of contrast weights, which will be used to compute a linear combination of estimates:

predictions(
    mod,
    hypothesis = c(1, -1, 0, 0),
    newdata = datagrid(am = 0:1, vs = 0:1))
#> 
#>    Term Estimate Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  custom    -6.93       1.26 -5.49   <0.001  -9.4  -4.46
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

There are many more possibilities:

predictions(
    mod,
    hypothesis = "b1 + b2 = 30",
    newdata = datagrid(am = 0:1, vs = 0:1))
#> 
#>      Term Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>  b1+b2=30     6.12       1.64 3.74   <0.001  2.91   9.32
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

p$estimate[1] + p$estimate[2] - 30
#> [1] 6.118254

predictions(
    mod,
    hypothesis = "(b2 - b1) / (b3 - b2) = 0",
    newdata = datagrid(am = 0:1, vs = 0:1))
#> 
#>               Term Estimate Std. Error      z Pr(>|z|) 2.5 % 97.5 %
#>  (b2-b1)/(b3-b2)=0    -8.03         17 -0.473    0.636 -41.3   25.2
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Average contrasts or marginal effects

The standard workflow with the marginaleffects package is to call a function like predictions(), slopes() or comparisons() to compute unit-level quantities; or one of their cousins avg_predictions(), avg_comparisons(), or avg_slopes() to aggregate the unit-level quantities into “Average Marginal Effects” or “Average Contrasts.” We can also use the comparison argument to emulate the behavior of the avg_*() functions.

First, note that these three commands produce the same results:

comparisons(mod, variables = "vs")$estimate |> mean()
#> [1] 6.929365

avg_comparisons(mod, variables = "vs")
#> 
#>  Term Contrast Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>    vs    1 - 0     6.93       1.26 5.49   <0.001  4.46    9.4
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high

comparisons(
    mod,
    variables = "vs",
    comparison = "differenceavg")
#> 
#>  Term          Contrast Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
#>    vs mean(1) - mean(0)     6.93       1.26 5.49   <0.001  4.46    9.4
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

See the transformations section of the Contrasts vignette for more details.

With these results in hand, we can now conduct a linear hypothesis test between average marginal effects:

comparisons(
    mod,
    hypothesis = "am = vs",
    comparison = "differenceavg")
#> 
#>   Term Estimate Std. Error      z Pr(>|z|) 2.5 % 97.5 %
#>  am=vs   -0.863       1.94 -0.445    0.656 -4.66   2.94
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Computing contrasts between average marginal effects requires a little care to obtain the right scale. In particular, we need to specify both the variables and the comparison:

comparisons(
    mod,
    hypothesis = "am = vs",
    variables = c("am", "vs"),
    comparison = "dydxavg")
#> 
#>   Term Estimate Std. Error      z Pr(>|z|) 2.5 % 97.5 %
#>  am=vs   -0.863       1.94 -0.445    0.656 -4.66   2.94
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

hypotheses Vectors and Matrices

The marginal_means() function computes estimated marginal means. The hypothesis argument of that function offers a powerful mechanism to estimate custom contrasts between marginal means, by way of linear combination.

Consider a simple example:

library(marginaleffects)
library(emmeans)
library(nnet)

dat <- mtcars
dat$carb <- factor(dat$carb)
dat$cyl <- factor(dat$cyl)
dat$am <- as.logical(dat$am)

mod <- lm(mpg ~ carb + cyl, dat)
mm <- marginal_means(mod, variables = "carb")
mm
#> 
#>  Term Value Mean Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  carb     4 18.9       1.21 15.59   <0.001  16.5   21.3
#>  carb     1 21.7       1.44 15.06   <0.001  18.8   24.5
#>  carb     2 21.3       1.23 17.29   <0.001  18.9   23.8
#>  carb     3 21.4       2.19  9.77   <0.001  17.1   25.7
#>  carb     6 19.8       3.55  5.56   <0.001  12.8   26.7
#>  carb     8 20.1       3.51  5.73   <0.001  13.2   27.0
#> 
#> Results averaged over levels of: cyl, carb 
#> Columns: term, value, carb, estimate, std.error, statistic, p.value, conf.low, conf.high

The contrast between marginal means for carb==1 and carb==2 is:

21.66232 - 21.34058 
#> [1] 0.32174

or

21.66232 + -(21.34058)
#> [1] 0.32174

or

sum(c(21.66232, 21.34058) * c(1, -1))
#> [1] 0.32174

or

c(21.66232, 21.34058) %*% c(1, -1)
#>         [,1]
#> [1,] 0.32174

The last two commands express the contrast of interest as a linear combination of marginal means.

Simple contrast

In the marginal_means() function, we can supply a hypothesis argument to compute linear combinations of marginal means. This argument must be a numeric vector of the same length as the number of rows in the output of marginal_means(). For example, in the previous there were six rows, and the two marginal means we want to compare are at in the first two positions:

lc <- c(1, -1, 0, 0, 0, 0)
marginal_means(mod, variables = "carb", hypothesis = lc)
#> 
#>    Term  Mean Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  custom -2.78       2.08 -1.34    0.181 -6.85    1.3
#> 
#> Results averaged over levels of: cyl, carb 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Complex contrast

Of course, we can also estimate more complex contrasts:

lc <- c(-2, 1, 1, 0, -1, 1)
marginal_means(mod, variables = "carb", hypothesis = lc)
#> 
#>    Term Mean Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  custom 5.59        6.2 0.901    0.367 -6.57   17.7
#> 
#> Results averaged over levels of: cyl, carb 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

emmeans produces similar results:

library(emmeans)
em <- emmeans(mod, "carb")
lc <- data.frame(custom_contrast = c(-2, 1, 1, 0, -1, 1))
contrast(em, method = lc)
#>  contrast        estimate   SE df t.ratio p.value
#>  custom_contrast   -0.211 6.93 24  -0.030  0.9760
#> 
#> Results are averaged over the levels of: cyl

Multiple contrasts

Users can also compute multiple linear combinations simultaneously by supplying a numeric matrix to hypotheses. This matrix must have the same number of rows as the output of slopes(), and each column represents a distinct set of weights for different linear combinations. The column names of the matrix become labels in the output. For example:

lc <- matrix(c(
    -2, 1, 1, 0, -1, 1,
    1, -1, 0, 0, 0, 0
    ), ncol = 2)
colnames(lc) <- c("Contrast A", "Contrast B")
lc
#>      Contrast A Contrast B
#> [1,]         -2          1
#> [2,]          1         -1
#> [3,]          1          0
#> [4,]          0          0
#> [5,]         -1          0
#> [6,]          1          0

marginal_means(mod, variables = "carb", hypothesis = lc)
#> 
#>        Term  Mean Std. Error      z Pr(>|z|) 2.5 % 97.5 %
#>  Contrast A  5.59       6.20  0.901    0.367 -6.57   17.7
#>  Contrast B -2.78       2.08 -1.337    0.181 -6.85    1.3
#> 
#> Results averaged over levels of: cyl, carb 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Contrasts across response levels

In models with multinomial outcomes, one may be interested in comparing outcomes or contrasts across response levels. For example, in this model there are 18 estimated marginal means, across 6 outcome levels (the group column):

library(nnet)
mod <- multinom(carb ~ mpg + cyl, data = dat, trace = FALSE)
mm <- marginal_means(mod, type = "probs")
mm
#> 
#>  Group Term Value     Mean Std. Error       z Pr(>|z|)     2.5 %   97.5 %
#>      1  cyl     6 2.83e-01   1.96e-01 1.44491   0.1485 -1.01e-01 6.67e-01
#>      1  cyl     4 3.68e-01   2.60e-01 1.41643   0.1566 -1.41e-01 8.77e-01
#>      1  cyl     8 4.63e-04   9.59e-03 0.04824   0.9615 -1.83e-02 1.93e-02
#>      2  cyl     6 1.85e-06   2.40e-06 0.77087   0.4408 -2.85e-06 6.55e-06
#>      2  cyl     4 6.31e-01   2.60e-01 2.42765   0.0152  1.22e-01 1.14e+00
#>      2  cyl     8 6.65e-01   3.74e-01 1.77961   0.0751 -6.74e-02 1.40e+00
#>      3  cyl     6 1.12e-05   1.32e-03 0.00848   0.9932 -2.58e-03 2.60e-03
#>      3  cyl     4 6.85e-04   1.19e-02 0.05748   0.9542 -2.27e-02 2.40e-02
#>      3  cyl     8 3.10e-01   3.71e-01 0.83676   0.4027 -4.17e-01 1.04e+00
#>      4  cyl     6 5.56e-01   2.18e-01 2.55000   0.0108  1.29e-01 9.84e-01
#>      4  cyl     4 2.12e-04   1.75e-02 0.01211   0.9903 -3.41e-02 3.45e-02
#>      4  cyl     8 9.58e-03   2.28e-02 0.41999   0.6745 -3.51e-02 5.43e-02
#>      6  cyl     6 1.61e-01   1.54e-01 1.04685   0.2952 -1.40e-01 4.62e-01
#>      6  cyl     4 8.82e-06   8.39e-05 0.10505   0.9163 -1.56e-04 1.73e-04
#>      6  cyl     8 4.35e-09   9.89e-08 0.04393   0.9650 -1.90e-07 1.98e-07
#>      8  cyl     6 9.29e-06   7.98e-04 0.01164   0.9907 -1.56e-03 1.57e-03
#>      8  cyl     4 1.50e-04   7.97e-03 0.01878   0.9850 -1.55e-02 1.58e-02
#>      8  cyl     8 1.41e-02   4.66e-02 0.30317   0.7618 -7.72e-02 1.05e-01
#> 
#> Results averaged over levels of: mpg, cyl 
#> Columns: group, term, value, cyl, estimate, std.error, statistic, p.value, conf.low, conf.high

Let’s contrast the marginal means in the first outcome level when cyl equals 4 and 6. These marginal means are located in rows 1 and 7 respectively:

lc <- rep(0, nrow(mm))
lc[1] <- -1
lc[7] <- 1
marginal_means(
    mod,
    type = "probs",
    hypothesis = lc)
#> 
#>    Term   Mean Std. Error     z Pr(>|z|)  2.5 % 97.5 %
#>  custom -0.283      0.196 -1.44    0.149 -0.667  0.101
#> 
#> Results averaged over levels of: mpg, cyl 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

This is indeed equal to the results we would have obtained manually:

2.828726e-01 - 3.678521e-01
#> [1] -0.0849795

Now let’s say we want to calculate a “contrast in contrasts”, that is, the outcome of a 3-step process:

  1. Contrast between cyl=6 and cyl=4 in the 1st outcome level
  2. Contrast between cyl=6 and cyl=4 in the 2nd outcome level
  3. Contrast between the contrasts defined in steps 1 and 2.

We create the linear combination weights as follows:

lc <- rep(0, nrow(mm))
lc[c(1, 8)] <- -1
lc[c(7, 2)] <- 1

To make sure that the weights are correct, we can display them side by side with the original marginal_means() output:

transform(mm[, 1:3], lc = lc)
#>    group term value lc
#> 1      1  cyl     6 -1
#> 2      1  cyl     4  1
#> 3      1  cyl     8  0
#> 4      2  cyl     6  0
#> 5      2  cyl     4  0
#> 6      2  cyl     8  0
#> 7      3  cyl     6  1
#> 8      3  cyl     4 -1
#> 9      3  cyl     8  0
#> 10     4  cyl     6  0
#> 11     4  cyl     4  0
#> 12     4  cyl     8  0
#> 13     6  cyl     6  0
#> 14     6  cyl     4  0
#> 15     6  cyl     8  0
#> 16     8  cyl     6  0
#> 17     8  cyl     4  0
#> 18     8  cyl     8  0

Compute the results:

marginal_means(mod, type = "probs", hypothesis = lc)
#> 
#>    Term   Mean Std. Error     z Pr(>|z|)  2.5 % 97.5 %
#>  custom 0.0843      0.321 0.263    0.793 -0.545  0.714
#> 
#> Results averaged over levels of: mpg, cyl 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Pairwise contrasts: Difference-in-Differences

Now we illustrate how to use the machinery described above to do pairwise comparisons between contrasts, a type of analysis often associated with a “Difference-in-Differences” research design.

First, we simulate data with two treatment groups and pre/post periods:

library(data.table)

N <- 1000
did <- data.table(
    id = 1:N,
    pre = rnorm(N),
    trt = sample(0:1, N, replace = TRUE))
did$post <- did$pre + did$trt * 0.3 + rnorm(N)
did <- melt(
    did,
    value.name = "y",
    variable.name = "time",
    id.vars = c("id", "trt"))
head(did)
#>    id trt time          y
#> 1:  1   1  pre -0.3684420
#> 2:  2   0  pre  0.7661930
#> 3:  3   0  pre -0.3778234
#> 4:  4   0  pre  1.5173513
#> 5:  5   1  pre -0.4112270
#> 6:  6   1  pre -1.2143283

Then, we estimate a linear model with a multiple interaction between the time and the treatment indicators. We also compute contrasts at the mean for each treatment level:

did_model <- lm(y ~ time * trt, data = did)

comparisons(
    did_model,
    newdata = datagrid(trt = 0:1),
    variables = "time")
#> 
#>  Term   Contrast Estimate Std. Error      z Pr(>|z|)  2.5 % 97.5 % trt
#>  time post - pre  -0.0533     0.0779 -0.684    0.494 -0.206 0.0994   0
#>  time post - pre   0.2760     0.0761  3.628   <0.001  0.127 0.4250   1
#> 
#> Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, y, time, trt

Finally, we compute pairwise differences between contrasts. This is the Diff-in-Diff estimate:

comparisons(
    did_model,
    variables = "time",
    newdata = datagrid(trt = 0:1),
    hypothesis = "pairwise")
#> 
#>           Term Estimate Std. Error     z Pr(>|z|)  2.5 % 97.5 %
#>  Row 1 - Row 2   -0.329      0.109 -3.02  0.00249 -0.543 -0.116
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

Joint hyotheses tests

The hypotheses() function can also test multiple hypotheses jointly. For example, consider this model:

model <- lm(mpg ~ as.factor(cyl) * hp, data = mtcars)
coef(model)
#>        (Intercept)    as.factor(cyl)6    as.factor(cyl)8                 hp as.factor(cyl)6:hp as.factor(cyl)8:hp 
#>        35.98302564       -15.30917451       -17.90295193        -0.11277589         0.10516262         0.09853177

We may want to test the null hypothesis that two of the coefficients are jointly (both) equal to zero.

hypotheses(model, joint = c("as.factor(cyl)6:hp", "as.factor(cyl)8:hp"))
#> 
#> 
#> Joint hypothesis test:
#> as.factor(cyl)6:hp = 0
#> as.factor(cyl)8:hp = 0
#>  
#>     F Pr(>|F|) Df 1 Df 2
#>  2.11    0.142    2   26
#> 
#> Columns: statistic, p.value, df1, df2

The joint argument allows users to flexibly specify the parameters to be tested, using character vectors, integer indices, or Perl-compatible regular expressions. We can also specificy the null hypothesis for each parameter individually using the hypothesis argument.

Naturally, the hypotheses function also works with marginaleffects objects.

# joint hypotheses: regular expression
hypotheses(model, joint = "cyl")
#> 
#> 
#> Joint hypothesis test:
#>  as.factor(cyl)6 = 0
#>  as.factor(cyl)8 = 0
#>  as.factor(cyl)6:hp = 0
#>  as.factor(cyl)8:hp = 0
#>  
#>    F Pr(>|F|) Df 1 Df 2
#>  5.7  0.00197    4   26
#> 
#> Columns: statistic, p.value, df1, df2

# joint hypotheses: integer indices
hypotheses(model, joint = 2:3)
#> 
#> 
#> Joint hypothesis test:
#>  as.factor(cyl)6 = 0
#>  as.factor(cyl)8 = 0
#>  
#>     F Pr(>|F|) Df 1 Df 2
#>  6.12  0.00665    2   26
#> 
#> Columns: statistic, p.value, df1, df2

# joint hypotheses: different null hypotheses
hypotheses(model, joint = 2:3, hypothesis = 1)
#> 
#> 
#> Joint hypothesis test:
#>  as.factor(cyl)6 = 1
#>  as.factor(cyl)8 = 1
#>  
#>     F Pr(>|F|) Df 1 Df 2
#>  6.84  0.00411    2   26
#> 
#> Columns: statistic, p.value, df1, df2
hypotheses(model, joint = 2:3, hypothesis = 1:2)
#> 
#> 
#> Joint hypothesis test:
#>  as.factor(cyl)6 = 1
#>  as.factor(cyl)8 = 2
#>  
#>     F Pr(>|F|) Df 1 Df 2
#>  7.47  0.00273    2   26
#> 
#> Columns: statistic, p.value, df1, df2

# joint hypotheses: marginaleffects object
cmp <- avg_comparisons(model)
hypotheses(cmp, joint = "cyl")
#> 
#> 
#> Joint hypothesis test:
#>  cyl 6 - 4 = 0
#>  cyl 8 - 4 = 0
#>  
#>    F Pr(>|F|) Df 1 Df 2
#>  1.6    0.219    2   29
#> 
#> Columns: statistic, p.value, df1, df2

We can also combine multiple calls to hypotheses to execute a joint test on linear combinations of coefficients:

# fit model
mod <- lm(mpg ~ factor(carb), mtcars)

# hypothesis matrix for linear combinations
H <- matrix(0, nrow = length(coef(mod)), ncol = 2)
H[2:3, 1] <- H[4:6, 2] <- 1

# test individual linear combinations
hyp <- hypotheses(mod, hypothesis = H)
hyp
#> 
#>    Term Estimate Std. Error     z Pr(>|z|) 2.5 % 97.5 %
#>  custom    -12.0       4.92 -2.44  0.01477 -21.6  -2.35
#>  custom    -25.5       9.03 -2.83  0.00466 -43.2  -7.85
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

# test joint hypotheses
hypotheses(hyp, joint = TRUE, hypothesis = c(-10, -20))
#> 
#> 
#> Joint hypothesis test:
#>  b1 = -10
#>  b2 = -20
#>  
#>      F Pr(>|F|) Df 1 Df 2
#>  0.197    0.822    2   30
#> 
#> Columns: statistic, p.value, df1, df2

Equivalence, non-inferiority, and non-superiority tests

In many contexts, analysts are less interested in rejecting a null hypothesis, and more interested in testing whether an estimate is “inferior”, “superior”, or “equivalent” to a given threshold or interval. For example, medical researchers may wish to determine if the estimated effect of a new treatment is larger than the effect of prior treatments, or larger than some threshold of “clinical significance.” Alternatively, researchers may wish to support a claim that an estimated parameter is “equivalent to” or “not meaningfully different from” a null hypothesis.

To answer these questions, we can use non-inferiority, non-superiority, or equivalence tests like the two-one-sided test (TOST). This article gives a primer and tutorial on TOST:

Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science. 2018;1(2):259-269. doi:10.1177/2515245918770963

The hypotheses() function of the marginaleffects package includes an equivalence argument which allows users to apply these tests to any of the quantities generated by the package, as well as to arbitrary functions of a model’s parameters. To illustrate, we begin by estimating a simple linear regression model:

mod <- lm(mpg ~ hp + factor(gear), data = mtcars)

The rest of this section considers several quantities estimated by marginaleffects.

Example 1: Predictions

Consider a single prediction, where all predictors are held at their median or mode:

p <- predictions(mod, newdata = "median")
p
#> 
#>  Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %  hp gear
#>      19.7          1 19.6   <0.001  17.7   21.6 123    3
#> 
#> Columns: rowid, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, hp, gear

Now we specify an equivalence interval (or “region”) for predictions between 17 and 18:

hypotheses(p, equivalence = c(17, 18))
#> 
#>  Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %  hp gear p (NonInf) p (NonSup) p (Equiv)
#>      19.7          1 19.6   <0.001  17.7   21.6 123    3    0.00404      0.951     0.951
#> 
#> Columns: rowid, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, hp, gear, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

The results allow us to draw three conclusions:

  1. The p value for the non-inferiority test is 0.0040. This suggests that we can reject the null hypothesis that the parameter is below 17.
  2. The p value for the non-superiority test is 0.9508. This suggests that we cannot reject the null hypothesis that the parameter (19.6589) is above 18.
  3. The p value for the equivalence test is 0.9508. This suggests that we cannot reject the hypothesis that the parameter falls outside the equivalence interval.
Example 2: Model coefficients

The hypotheses function also allows users to conduct equivalence, non-inferiority, and non-superiority tests for model coefficients, and for arbitrary functions of model coefficients.

Our estimate of the 4th coefficient in the model is:

coef(mod)[4]
#> factor(gear)5 
#>      6.574763

We can test if this parameter is likely to fall in the [5,7] interval by:

hypotheses(mod, equivalence = c(5, 7))[4, ]
#> 
#>  Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % p (NonInf) p (NonSup) p (Equiv)
#>    b4     6.57       1.64 4   <0.001  3.36   9.79      0.169      0.398     0.398
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

The p value is 0.3979, so we cannot reject the hypothesis that the factor(gear)5 parameter falls outside the [5,7] interval.

Example 3: Slopes

The same syntax can be used to conduct tests for all the quantities produced by the marginaleffects package. For example, imagine that, for substantive or theoretical reasons, an average slope between -0.1 and 0.1 is uninteresting. We can conduct an equivalence test to check if this is the case:

avg_slopes(mod, variables = "hp", equivalence = c(-.1, .1))
#> 
#>  Term Estimate Std. Error     z Pr(>|z|)   2.5 %  97.5 % p (NonInf) p (NonSup) p (Equiv)
#>    hp  -0.0669      0.011 -6.05   <0.001 -0.0885 -0.0452    0.00135     <0.001   0.00135
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

The p value is 0.0013, which suggests that we can reject the hypothesis that the parameter falls outside the region of “substantive equivalence” that we have defined by the interval.

Example 4: Difference between comparisons (contrasts)

Consider a model with a multiplicative interaction:

int <- lm(mpg ~ hp * factor(gear), data = mtcars)

The average contrast for a change of 1 unit in hp differs based on the value of gear:

avg_comparisons(int, variables = "hp", by = "gear")
#> 
#>  Term Contrast gear Estimate Std. Error     z Pr(>|z|)   2.5 %  97.5 %
#>    hp mean(+1)    4  -0.1792     0.0303 -5.92   <0.001 -0.2385 -0.1199
#>    hp mean(+1)    3  -0.0522     0.0146 -3.59   <0.001 -0.0808 -0.0237
#>    hp mean(+1)    5  -0.0583     0.0126 -4.61   <0.001 -0.0830 -0.0335
#> 
#> Columns: term, contrast, gear, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

Are these contrasts different from one another? Let’s look at the pairwise differences between them:

avg_comparisons(int, variables = "hp", by = "gear",
    hypothesis = "pairwise")
#> 
#>   Term Estimate Std. Error      z Pr(>|z|)   2.5 %  97.5 %
#>  4 - 3 -0.12695     0.0336 -3.781   <0.001 -0.1928 -0.0611
#>  4 - 5 -0.12092     0.0328 -3.688   <0.001 -0.1852 -0.0567
#>  3 - 5  0.00603     0.0193  0.313    0.754 -0.0318  0.0438
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high

We consider that these pairwise comparisons are “equivalent to zero” when they fall in the [-.1, .1] interval:

avg_comparisons(int, variables = "hp", by = "gear",
    hypothesis = "pairwise",
    equivalence = c(-.1, .1))
#> 
#>   Term Estimate Std. Error      z Pr(>|z|)   2.5 %  97.5 % p (NonInf) p (NonSup) p (Equiv)
#>  4 - 3 -0.12695     0.0336 -3.781   <0.001 -0.1928 -0.0611      0.789     <0.001     0.789
#>  4 - 5 -0.12092     0.0328 -3.688   <0.001 -0.1852 -0.0567      0.738     <0.001     0.738
#>  3 - 5  0.00603     0.0193  0.313    0.754 -0.0318  0.0438     <0.001     <0.001    <0.001
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

The p (Equiv) column shows that the difference between the average contrasts when gear is 3 and gear is 5 can be said to be equivalent to the specified interval. However, there are good reasons to think that the other two pairwise comparisons may fall outside the interval.

Example 5: Marginal means and emmeans

This example shows the equivalence between results produced by the emmeans package and the marginal_means() function:

library(emmeans)

mod <- lm(log(conc) ~ source + factor(percent), data = pigs)

# {emmeans}
emmeans(mod, specs = "source") |>
    pairs() |>
    test(df = Inf,
         null = 0,
         delta = log(1.25),
         side = "equivalence",
         adjust = "none")
#>  contrast    estimate     SE  df z.ratio p.value
#>  fish - soy    -0.273 0.0529 Inf   0.937  0.8257
#>  fish - skim   -0.402 0.0542 Inf   3.308  0.9995
#>  soy - skim    -0.130 0.0530 Inf  -1.765  0.0388
#> 
#> Results are averaged over the levels of: percent 
#> Degrees-of-freedom method: user-specified 
#> Results are given on the log (not the response) scale. 
#> Statistics are tests of equivalence with a threshold of 0.22314 
#> P values are left-tailed

# {marginaleffects}
marginal_means(
    mod,
    variables = "source",
    hypothesis = "pairwise",
    equivalence = c(-log(1.25), log(1.25)))
#> 
#>         Term   Mean Std. Error     z Pr(>|z|)  2.5 %  97.5 % p (Equiv) p (NonInf) p (NonSup)
#>  fish - soy  -0.273     0.0529 -5.15   <0.001 -0.377 -0.1690    0.8257     0.8257     <0.001
#>  fish - skim -0.402     0.0542 -7.43   <0.001 -0.508 -0.2961    0.9995     0.9995     <0.001
#>  soy - skim  -0.130     0.0530 -2.44   0.0146 -0.233 -0.0255    0.0388     0.0388     <0.001
#> 
#> Results averaged over levels of: percent, source 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, p.value.equiv, p.value.noninf, p.value.nonsup, statistic.noninf, statistic.nonsup
Example 6: t-test

Now we show that the results produced by hypotheses() are identical to the results produced by the equivalence package in the case of a simple t-test:

library(equivalence)

set.seed(1024)

# simulate data data
N <- 20
dat <- data.frame(
    y = rnorm(N),
    x = sample(c(rep(0, N / 2), rep(1, N / 2)), N))

# fit model
mod <- lm(y ~ x, data = dat)

# test with the {equivalence} package
e <- tost(
    x = dat$y[dat$x == 0],
    y = dat$y[dat$x == 1],
    epsilon = 10)
e
#> 
#>  Welch Two Sample TOST
#> 
#> data:  dat$y[dat$x == 0] and dat$y[dat$x == 1]
#> df = 17.607
#> sample estimates:
#>  mean of x  mean of y 
#> -0.3788551 -0.2724594 
#> 
#> Epsilon: 10 
#> 95 percent two one-sided confidence interval (TOST interval):
#>  -1.058539  0.845747
#> Null hypothesis of statistical difference is: rejected 
#> TOST p-value: 4.248528e-13

# test with {marginaleffects} package
hypotheses(mod, equivalence = c(-10, 10), df = e$parameter)[2, ]
#> 
#>  Term Estimate Std. Error     t Pr(>|t|) 2.5 % 97.5 %   Df p (NonInf) p (NonSup) p (Equiv)
#>    b2    0.106      0.548 0.194    0.848 -1.05   1.26 17.6     <0.001     <0.001    <0.001
#> 
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, df, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv