Hypothesis Tests, Equivalence Tests, and Custom Contrasts
Source:vignettes/hypothesis.Rmd
hypothesis.Rmd
The code in this vignette requires marginaleffects
version 0.6.0 or the
development version hosted on Github.
This vignette introduces the hypotheses()
function, and
the hypothesis
argument of the comparisons()
,
slopes()
, and predictions()
function. These
features allow users to conduct linear and non-linear hypothesis tests
and to compute custom contrasts (linear combinations) between
parameters.
Null hypothesis
The simplest way to modify a hypothesis test is to change the null
hypothesis. By default, all functions in the
marginaleffects
package assume that the null is 0. This can
be changed by changing the hypothesis
argument.
For example, consider a logistic regression model:
library(marginaleffects)
mod <- glm(am ~ hp + drat, data = mtcars, family = binomial)
We can compute the predicted outcome for a hypothetical unit where all regressors are fixed to their sample means:
predictions(mod, newdata = "mean")
#>
#> Estimate Pr(>|z|) 2.5 % 97.5 % hp drat
#> 0.231 0.135 0.0584 0.592 147 3.6
#>
#> Columns: rowid, estimate, p.value, conf.low, conf.high, am, hp, drat
The Z statistic and p value reported above assume that the null
hypothesis equals zero. We can change the null with the
hypothesis
argument:
predictions(mod, newdata = "mean", hypothesis = .5)
#>
#> Estimate Pr(>|z|) 2.5 % 97.5 % hp drat
#> 0.231 0.0343 0.0584 0.592 147 3.6
#>
#> Columns: rowid, estimate, p.value, conf.low, conf.high, am, hp, drat
This can obviously be useful in other contexts. For instance, if we
compute risk ratios (at the mean) associated with an increase of 1 unit
in hp
, it makes more sense to test the null hypothesis that
the ratio of predictions is 1 rather than 0:
comparisons(
mod,
newdata = "mean",
variables = "hp",
comparison = "ratio",
hypothesis = 1) |>
print(digits = 3)
#>
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> hp +1 1.01 0.00793 1.05 0.293 0.993 1.02
#>
#> Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, am, hp, drat
Warning: Z statistics and p values are computed before
applying functions in transform
.
Hypothesis tests with the delta method
Version 0.6.0 of marginaleffects
includes powerful
function called hypotheses()
. This function emulates the
behavior of the well-established car::deltaMethod
and
car::linearHypothesis
functions, but it supports more
models, requires fewer dependencies, and offers some convenience
features like shortcuts for robust standard errors.
hypotheses()
can be used to compute estimates and
standard errors of arbitrary functions of model parameters. For example,
it can be used to conduct tests of equality between coefficients, or to
test the value of some linear or non-linear combination of quantities of
interest. hypotheses()
can also be used to conduct
hypothesis tests on other functions of a model’s parameter, such as
adjusted predictions or marginal effects.
Let’s start by estimating a simple model:
library(marginaleffects)
mod <- lm(mpg ~ hp + wt + factor(cyl), data = mtcars)
When the FUN
and hypothesis
arguments of
hypotheses()
equal NULL
(the default), the
function returns a data.frame of raw estimates:
hypotheses(mod)
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> b1 35.8460 2.041 17.56 <0.001 31.8457 39.846319
#> b2 -0.0231 0.012 -1.93 0.0531 -0.0465 0.000306
#> b3 -3.1814 0.720 -4.42 <0.001 -4.5918 -1.771012
#> b4 -3.3590 1.402 -2.40 0.0166 -6.1062 -0.611803
#> b5 -3.1859 2.170 -1.47 0.1422 -7.4399 1.068169
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Test of equality between coefficients:
hypotheses(mod, "hp = wt")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> hp = wt 3.16 0.72 4.39 <0.001 1.75 4.57
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Non-linear function of coefficients
hypotheses(mod, "exp(hp + wt) = 0.1")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> exp(hp + wt) = 0.1 -0.0594 0.0292 -2.04 0.0418 -0.117 -0.0022
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
The vcov
argument behaves in the same was as in the
slopes()
function. It allows us to easily compute robust
standard errors:
hypotheses(mod, "hp = wt", vcov = "HC3")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> hp = wt 3.16 0.805 3.92 <0.001 1.58 4.74
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
We can use shortcuts like b1
, b2
,
...
to identify the position of each parameter in the
output of FUN
. For example, b2=b3
is
equivalent to hp=wt
because those term names appear in the
2nd and 3rd row when we call hypotheses(mod)
.
hypotheses(mod, "b2 = b3")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> b2 = b3 3.16 0.72 4.39 <0.001 1.75 4.57
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Term names with special characters must be enclosed in backticks:
hypotheses(mod, "`factor(cyl)6` = `factor(cyl)8`")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> `factor(cyl)6` = `factor(cyl)8` -0.173 1.65 -0.105 0.917 -3.41 3.07
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
The FUN
argument can be used to compute standard errors
for arbitrary functions of model parameters. This user-supplied function
must accept a single model object, and return a numeric vector or a
data.frame with two columns named term
and
estimate
.
mod <- glm(am ~ hp + mpg, data = mtcars, family = binomial)
f <- function(x) predict(x, type = "link", newdata = mtcars)
p <- hypotheses(mod, FUN = f)
head(p)
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> b1 -1.098 0.716 -1.534 0.125 -2.50 0.305
#> b2 -1.098 0.716 -1.534 0.125 -2.50 0.305
#> b3 0.233 0.781 0.299 0.765 -1.30 1.764
#> b4 -0.595 0.647 -0.919 0.358 -1.86 0.674
#> b5 -0.418 0.647 -0.645 0.519 -1.69 0.851
#> b6 -5.026 2.195 -2.290 0.022 -9.33 -0.725
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Test of equality between two predictions (row 2 vs row 3):
f <- function(x) predict(x, newdata = mtcars)
hypotheses(mod, FUN = f, hypothesis = "b2 = b3")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> b2 = b3 -1.33 0.616 -2.16 0.0305 -2.54 -0.125
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Note that we specified the newdata
argument in the
f
function. This is because the predict()
method associated with lm
objects will automatically the
original fitted values when newdata
is NULL
,
instead of returning the slightly altered fitted values which we need to
compute numerical derivatives in the delta method.
We can also use numeric vectors to specify linear combinations of parameters. For example, there are 3 coefficients in the last model we estimated. To test the null hypothesis that the sum of the 2nd and 3rd coefficients is equal to 0, we can do:
hypotheses(mod, hypothesis = c(0, 1, 1))
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> custom 1.31 0.593 2.22 0.0266 0.153 2.48
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
See below for more example of how to use string formulas, numeric vectors, or matrices to calculate custom contrasts, linear combinations, and linear or non-linear hypothesis tests.
hypotheses
Formulas
Each of the 4 core functions of the package support a
hypothesis
argument which behaves similarly to the
hypotheses()
function. This argument allows users to
specify custom hypothesis tests and contrasts, in order to test null
hypotheses such as:
- The coefficients \(\beta_1\) and \(\beta_2\) are equal.
- The marginal effects of \(X_1\) and \(X_2\) equal.
- The marginal effect of \(X\) when \(W=0\) is equal to the marginal effect of \(X\) when \(W=1\).
- A non-linear function of adjusted predictions is equal to 100.
- The marginal mean in the control group is equal to the average of marginal means in the other 3 treatment arms.
- Cross-level contrasts: In a multinomial model, the effect of \(X\) on the 1st outcome level is equal to the effect of \(X\) on the 2nd outcome level.
Marginal effects
For example, let’s fit a model and compute some marginal effects at the mean:
library(marginaleffects)
mod <- lm(mpg ~ am + vs, data = mtcars)
mfx <- slopes(mod, newdata = "mean")
mfx
#>
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> am 1 - 0 6.07 1.27 4.76 <0.001 3.57 8.57
#> vs 1 - 0 6.93 1.26 5.49 <0.001 4.46 9.40
#>
#> Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, mpg, am, vs
Is the marginal effect of am
different from the marginal
effect of vs
? To answer this question we can run a linear
hypothesis test using the hypotheses
function:
hypotheses(mfx, hypothesis = "am = vs")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> am=vs -0.863 1.94 -0.445 0.656 -4.66 2.94
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Alternatively, we can specify the hypothesis directly in the original call:
library(marginaleffects)
mod <- lm(mpg ~ am + vs, data = mtcars)
slopes(
mod,
newdata = "mean",
hypothesis = "am = vs")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> am=vs -0.863 1.94 -0.445 0.656 -4.66 2.94
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
The hypotheses
string can include any valid
R
expression, so we can run some silly non-linear
tests:
slopes(
mod,
newdata = "mean",
hypothesis = "exp(am) - 2 * vs = -400")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> exp(am)-2*vs=-400 817 550 1.49 0.137 -261 1896
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Adjusted Predictions
Now consider the case of adjusted predictions:
p <- predictions(
mod,
newdata = datagrid(am = 0:1, vs = 0:1))
p
#>
#> Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % am vs
#> 14.6 0.926 15.8 <0.001 12.8 16.4 0 0
#> 21.5 1.130 19.0 <0.001 19.3 23.7 0 1
#> 20.7 1.183 17.5 <0.001 18.3 23.0 1 0
#> 27.6 1.130 24.4 <0.001 25.4 29.8 1 1
#>
#> Columns: rowid, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, am, vs
Since there is no term
column in the output of the
predictions
function, we must use parameter identifiers
like b1
, b2
, etc. to determine which estimates
we want to compare:
hypotheses(p, hypothesis = "b1 = b2")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> b1=b2 -6.93 1.26 -5.49 <0.001 -9.4 -4.46
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Or directly:
predictions(
mod,
hypothesis = "b1 = b2",
newdata = datagrid(am = 0:1, vs = 0:1))
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> b1=b2 -6.93 1.26 -5.49 <0.001 -9.4 -4.46
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
p$estimate[1] - p$estimate[2]
#> [1] -6.929365
In the next section, we will see that we can get equivalent results by using a vector of contrast weights, which will be used to compute a linear combination of estimates:
predictions(
mod,
hypothesis = c(1, -1, 0, 0),
newdata = datagrid(am = 0:1, vs = 0:1))
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> custom -6.93 1.26 -5.49 <0.001 -9.4 -4.46
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
There are many more possibilities:
predictions(
mod,
hypothesis = "b1 + b2 = 30",
newdata = datagrid(am = 0:1, vs = 0:1))
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> b1+b2=30 6.12 1.64 3.74 <0.001 2.91 9.32
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
p$estimate[1] + p$estimate[2] - 30
#> [1] 6.118254
predictions(
mod,
hypothesis = "(b2 - b1) / (b3 - b2) = 0",
newdata = datagrid(am = 0:1, vs = 0:1))
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> (b2-b1)/(b3-b2)=0 -8.03 17 -0.473 0.636 -41.3 25.2
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Average contrasts or marginal effects
The standard workflow with the marginaleffects
package
is to call a function like predictions()
,
slopes()
or comparisons()
to compute
unit-level quantities; or one of their cousins
avg_predictions()
, avg_comparisons()
, or
avg_slopes()
to aggregate the unit-level quantities into
“Average Marginal Effects” or “Average Contrasts.” We can also use the
comparison
argument to emulate the behavior of the
avg_*()
functions.
First, note that these three commands produce the same results:
comparisons(mod, variables = "vs")$estimate |> mean()
#> [1] 6.929365
avg_comparisons(mod, variables = "vs")
#>
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> vs 1 - 0 6.93 1.26 5.49 <0.001 4.46 9.4
#>
#> Columns: term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high
comparisons(
mod,
variables = "vs",
comparison = "differenceavg")
#>
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> vs mean(1) - mean(0) 6.93 1.26 5.49 <0.001 4.46 9.4
#>
#> Columns: term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo
See the transformations section of the Contrasts vignette for more details.
With these results in hand, we can now conduct a linear hypothesis test between average marginal effects:
comparisons(
mod,
hypothesis = "am = vs",
comparison = "differenceavg")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> am=vs -0.863 1.94 -0.445 0.656 -4.66 2.94
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Computing contrasts between average marginal effects requires a
little care to obtain the right scale. In particular, we need to specify
both the variables
and the comparison
:
comparisons(
mod,
hypothesis = "am = vs",
variables = c("am", "vs"),
comparison = "dydxavg")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> am=vs -0.863 1.94 -0.445 0.656 -4.66 2.94
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
hypotheses
Vectors and Matrices
The marginal_means()
function computes estimated
marginal means. The hypothesis
argument of that
function offers a powerful mechanism to estimate custom contrasts
between marginal means, by way of linear combination.
Consider a simple example:
library(marginaleffects)
library(emmeans)
library(nnet)
dat <- mtcars
dat$carb <- factor(dat$carb)
dat$cyl <- factor(dat$cyl)
dat$am <- as.logical(dat$am)
mod <- lm(mpg ~ carb + cyl, dat)
mm <- marginal_means(mod, variables = "carb")
mm
#>
#> Term Value Mean Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> carb 4 18.9 1.21 15.59 <0.001 16.5 21.3
#> carb 1 21.7 1.44 15.06 <0.001 18.8 24.5
#> carb 2 21.3 1.23 17.29 <0.001 18.9 23.8
#> carb 3 21.4 2.19 9.77 <0.001 17.1 25.7
#> carb 6 19.8 3.55 5.56 <0.001 12.8 26.7
#> carb 8 20.1 3.51 5.73 <0.001 13.2 27.0
#>
#> Results averaged over levels of: cyl, carb
#> Columns: term, value, carb, estimate, std.error, statistic, p.value, conf.low, conf.high
The contrast between marginal means for carb==1
and
carb==2
is:
21.66232 - 21.34058
#> [1] 0.32174
or
21.66232 + -(21.34058)
#> [1] 0.32174
or
or
The last two commands express the contrast of interest as a linear combination of marginal means.
Simple contrast
In the marginal_means()
function, we can supply a
hypothesis
argument to compute linear combinations of
marginal means. This argument must be a numeric vector of the same
length as the number of rows in the output of
marginal_means()
. For example, in the previous there were
six rows, and the two marginal means we want to compare are at in the
first two positions:
lc <- c(1, -1, 0, 0, 0, 0)
marginal_means(mod, variables = "carb", hypothesis = lc)
#>
#> Term Mean Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> custom -2.78 2.08 -1.34 0.181 -6.85 1.3
#>
#> Results averaged over levels of: cyl, carb
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Complex contrast
Of course, we can also estimate more complex contrasts:
lc <- c(-2, 1, 1, 0, -1, 1)
marginal_means(mod, variables = "carb", hypothesis = lc)
#>
#> Term Mean Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> custom 5.59 6.2 0.901 0.367 -6.57 17.7
#>
#> Results averaged over levels of: cyl, carb
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
emmeans
produces similar results:
Multiple contrasts
Users can also compute multiple linear combinations simultaneously by
supplying a numeric matrix to hypotheses
. This matrix must
have the same number of rows as the output of slopes()
, and
each column represents a distinct set of weights for different linear
combinations. The column names of the matrix become labels in the
output. For example:
lc <- matrix(c(
-2, 1, 1, 0, -1, 1,
1, -1, 0, 0, 0, 0
), ncol = 2)
colnames(lc) <- c("Contrast A", "Contrast B")
lc
#> Contrast A Contrast B
#> [1,] -2 1
#> [2,] 1 -1
#> [3,] 1 0
#> [4,] 0 0
#> [5,] -1 0
#> [6,] 1 0
marginal_means(mod, variables = "carb", hypothesis = lc)
#>
#> Term Mean Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> Contrast A 5.59 6.20 0.901 0.367 -6.57 17.7
#> Contrast B -2.78 2.08 -1.337 0.181 -6.85 1.3
#>
#> Results averaged over levels of: cyl, carb
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Contrasts across response levels
In models with multinomial outcomes, one may be interested in
comparing outcomes or contrasts across response levels. For example, in
this model there are 18 estimated marginal means, across 6 outcome
levels (the group
column):
library(nnet)
mod <- multinom(carb ~ mpg + cyl, data = dat, trace = FALSE)
mm <- marginal_means(mod, type = "probs")
mm
#>
#> Group Term Value Mean Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> 1 cyl 6 2.83e-01 1.96e-01 1.44491 0.1485 -1.01e-01 6.67e-01
#> 1 cyl 4 3.68e-01 2.60e-01 1.41643 0.1566 -1.41e-01 8.77e-01
#> 1 cyl 8 4.63e-04 9.59e-03 0.04824 0.9615 -1.83e-02 1.93e-02
#> 2 cyl 6 1.85e-06 2.40e-06 0.77087 0.4408 -2.85e-06 6.55e-06
#> 2 cyl 4 6.31e-01 2.60e-01 2.42765 0.0152 1.22e-01 1.14e+00
#> 2 cyl 8 6.65e-01 3.74e-01 1.77961 0.0751 -6.74e-02 1.40e+00
#> 3 cyl 6 1.12e-05 1.32e-03 0.00848 0.9932 -2.58e-03 2.60e-03
#> 3 cyl 4 6.85e-04 1.19e-02 0.05748 0.9542 -2.27e-02 2.40e-02
#> 3 cyl 8 3.10e-01 3.71e-01 0.83676 0.4027 -4.17e-01 1.04e+00
#> 4 cyl 6 5.56e-01 2.18e-01 2.55000 0.0108 1.29e-01 9.84e-01
#> 4 cyl 4 2.12e-04 1.75e-02 0.01211 0.9903 -3.41e-02 3.45e-02
#> 4 cyl 8 9.58e-03 2.28e-02 0.41999 0.6745 -3.51e-02 5.43e-02
#> 6 cyl 6 1.61e-01 1.54e-01 1.04685 0.2952 -1.40e-01 4.62e-01
#> 6 cyl 4 8.82e-06 8.39e-05 0.10505 0.9163 -1.56e-04 1.73e-04
#> 6 cyl 8 4.35e-09 9.89e-08 0.04393 0.9650 -1.90e-07 1.98e-07
#> 8 cyl 6 9.29e-06 7.98e-04 0.01164 0.9907 -1.56e-03 1.57e-03
#> 8 cyl 4 1.50e-04 7.97e-03 0.01878 0.9850 -1.55e-02 1.58e-02
#> 8 cyl 8 1.41e-02 4.66e-02 0.30317 0.7618 -7.72e-02 1.05e-01
#>
#> Results averaged over levels of: mpg, cyl
#> Columns: group, term, value, cyl, estimate, std.error, statistic, p.value, conf.low, conf.high
Let’s contrast the marginal means in the first outcome level when
cyl
equals 4 and 6. These marginal means are located in
rows 1 and 7 respectively:
lc <- rep(0, nrow(mm))
lc[1] <- -1
lc[7] <- 1
marginal_means(
mod,
type = "probs",
hypothesis = lc)
#>
#> Term Mean Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> custom -0.283 0.196 -1.44 0.149 -0.667 0.101
#>
#> Results averaged over levels of: mpg, cyl
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
This is indeed equal to the results we would have obtained manually:
2.828726e-01 - 3.678521e-01
#> [1] -0.0849795
Now let’s say we want to calculate a “contrast in contrasts”, that is, the outcome of a 3-step process:
- Contrast between
cyl=6
andcyl=4
in the 1st outcome level - Contrast between
cyl=6
andcyl=4
in the 2nd outcome level - Contrast between the contrasts defined in steps 1 and 2.
We create the linear combination weights as follows:
To make sure that the weights are correct, we can display them side
by side with the original marginal_means()
output:
transform(mm[, 1:3], lc = lc)
#> group term value lc
#> 1 1 cyl 6 -1
#> 2 1 cyl 4 1
#> 3 1 cyl 8 0
#> 4 2 cyl 6 0
#> 5 2 cyl 4 0
#> 6 2 cyl 8 0
#> 7 3 cyl 6 1
#> 8 3 cyl 4 -1
#> 9 3 cyl 8 0
#> 10 4 cyl 6 0
#> 11 4 cyl 4 0
#> 12 4 cyl 8 0
#> 13 6 cyl 6 0
#> 14 6 cyl 4 0
#> 15 6 cyl 8 0
#> 16 8 cyl 6 0
#> 17 8 cyl 4 0
#> 18 8 cyl 8 0
Compute the results:
marginal_means(mod, type = "probs", hypothesis = lc)
#>
#> Term Mean Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> custom 0.0843 0.321 0.263 0.793 -0.545 0.714
#>
#> Results averaged over levels of: mpg, cyl
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Pairwise contrasts: Difference-in-Differences
Now we illustrate how to use the machinery described above to do pairwise comparisons between contrasts, a type of analysis often associated with a “Difference-in-Differences” research design.
First, we simulate data with two treatment groups and pre/post periods:
library(data.table)
N <- 1000
did <- data.table(
id = 1:N,
pre = rnorm(N),
trt = sample(0:1, N, replace = TRUE))
did$post <- did$pre + did$trt * 0.3 + rnorm(N)
did <- melt(
did,
value.name = "y",
variable.name = "time",
id.vars = c("id", "trt"))
head(did)
#> id trt time y
#> 1: 1 1 pre -0.3684420
#> 2: 2 0 pre 0.7661930
#> 3: 3 0 pre -0.3778234
#> 4: 4 0 pre 1.5173513
#> 5: 5 1 pre -0.4112270
#> 6: 6 1 pre -1.2143283
Then, we estimate a linear model with a multiple interaction between the time and the treatment indicators. We also compute contrasts at the mean for each treatment level:
did_model <- lm(y ~ time * trt, data = did)
comparisons(
did_model,
newdata = datagrid(trt = 0:1),
variables = "time")
#>
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % trt
#> time post - pre -0.0533 0.0779 -0.684 0.494 -0.206 0.0994 0
#> time post - pre 0.2760 0.0761 3.628 <0.001 0.127 0.4250 1
#>
#> Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, y, time, trt
Finally, we compute pairwise differences between contrasts. This is the Diff-in-Diff estimate:
comparisons(
did_model,
variables = "time",
newdata = datagrid(trt = 0:1),
hypothesis = "pairwise")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> Row 1 - Row 2 -0.329 0.109 -3.02 0.00249 -0.543 -0.116
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Joint hyotheses tests
The hypotheses()
function can also test multiple
hypotheses jointly. For example, consider this model:
model <- lm(mpg ~ as.factor(cyl) * hp, data = mtcars)
coef(model)
#> (Intercept) as.factor(cyl)6 as.factor(cyl)8 hp as.factor(cyl)6:hp as.factor(cyl)8:hp
#> 35.98302564 -15.30917451 -17.90295193 -0.11277589 0.10516262 0.09853177
We may want to test the null hypothesis that two of the coefficients are jointly (both) equal to zero.
hypotheses(model, joint = c("as.factor(cyl)6:hp", "as.factor(cyl)8:hp"))
#>
#>
#> Joint hypothesis test:
#> as.factor(cyl)6:hp = 0
#> as.factor(cyl)8:hp = 0
#>
#> F Pr(>|F|) Df 1 Df 2
#> 2.11 0.142 2 26
#>
#> Columns: statistic, p.value, df1, df2
The joint
argument allows users to flexibly specify the
parameters to be tested, using character vectors, integer indices, or
Perl-compatible regular expressions. We can also specificy the null
hypothesis for each parameter individually using the
hypothesis
argument.
Naturally, the hypotheses
function also works with
marginaleffects
objects.
# joint hypotheses: regular expression
hypotheses(model, joint = "cyl")
#>
#>
#> Joint hypothesis test:
#> as.factor(cyl)6 = 0
#> as.factor(cyl)8 = 0
#> as.factor(cyl)6:hp = 0
#> as.factor(cyl)8:hp = 0
#>
#> F Pr(>|F|) Df 1 Df 2
#> 5.7 0.00197 4 26
#>
#> Columns: statistic, p.value, df1, df2
# joint hypotheses: integer indices
hypotheses(model, joint = 2:3)
#>
#>
#> Joint hypothesis test:
#> as.factor(cyl)6 = 0
#> as.factor(cyl)8 = 0
#>
#> F Pr(>|F|) Df 1 Df 2
#> 6.12 0.00665 2 26
#>
#> Columns: statistic, p.value, df1, df2
# joint hypotheses: different null hypotheses
hypotheses(model, joint = 2:3, hypothesis = 1)
#>
#>
#> Joint hypothesis test:
#> as.factor(cyl)6 = 1
#> as.factor(cyl)8 = 1
#>
#> F Pr(>|F|) Df 1 Df 2
#> 6.84 0.00411 2 26
#>
#> Columns: statistic, p.value, df1, df2
hypotheses(model, joint = 2:3, hypothesis = 1:2)
#>
#>
#> Joint hypothesis test:
#> as.factor(cyl)6 = 1
#> as.factor(cyl)8 = 2
#>
#> F Pr(>|F|) Df 1 Df 2
#> 7.47 0.00273 2 26
#>
#> Columns: statistic, p.value, df1, df2
# joint hypotheses: marginaleffects object
cmp <- avg_comparisons(model)
hypotheses(cmp, joint = "cyl")
#>
#>
#> Joint hypothesis test:
#> cyl 6 - 4 = 0
#> cyl 8 - 4 = 0
#>
#> F Pr(>|F|) Df 1 Df 2
#> 1.6 0.219 2 29
#>
#> Columns: statistic, p.value, df1, df2
We can also combine multiple calls to hypotheses
to
execute a joint test on linear combinations of coefficients:
# fit model
mod <- lm(mpg ~ factor(carb), mtcars)
# hypothesis matrix for linear combinations
H <- matrix(0, nrow = length(coef(mod)), ncol = 2)
H[2:3, 1] <- H[4:6, 2] <- 1
# test individual linear combinations
hyp <- hypotheses(mod, hypothesis = H)
hyp
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> custom -12.0 4.92 -2.44 0.01477 -21.6 -2.35
#> custom -25.5 9.03 -2.83 0.00466 -43.2 -7.85
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
# test joint hypotheses
hypotheses(hyp, joint = TRUE, hypothesis = c(-10, -20))
#>
#>
#> Joint hypothesis test:
#> b1 = -10
#> b2 = -20
#>
#> F Pr(>|F|) Df 1 Df 2
#> 0.197 0.822 2 30
#>
#> Columns: statistic, p.value, df1, df2
Equivalence, non-inferiority, and non-superiority tests
In many contexts, analysts are less interested in rejecting a null hypothesis, and more interested in testing whether an estimate is “inferior”, “superior”, or “equivalent” to a given threshold or interval. For example, medical researchers may wish to determine if the estimated effect of a new treatment is larger than the effect of prior treatments, or larger than some threshold of “clinical significance.” Alternatively, researchers may wish to support a claim that an estimated parameter is “equivalent to” or “not meaningfully different from” a null hypothesis.
To answer these questions, we can use non-inferiority, non-superiority, or equivalence tests like the two-one-sided test (TOST). This article gives a primer and tutorial on TOST:
Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science. 2018;1(2):259-269. doi:10.1177/2515245918770963
The hypotheses()
function of the
marginaleffects
package includes an
equivalence
argument which allows users to apply these
tests to any of the quantities generated by the package, as well as to
arbitrary functions of a model’s parameters. To illustrate, we begin by
estimating a simple linear regression model:
The rest of this section considers several quantities estimated by
marginaleffects
.
Example 1: Predictions
Consider a single prediction, where all predictors are held at their median or mode:
p <- predictions(mod, newdata = "median")
p
#>
#> Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % hp gear
#> 19.7 1 19.6 <0.001 17.7 21.6 123 3
#>
#> Columns: rowid, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, hp, gear
Now we specify an equivalence interval (or “region”) for predictions between 17 and 18:
hypotheses(p, equivalence = c(17, 18))
#>
#> Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % hp gear p (NonInf) p (NonSup) p (Equiv)
#> 19.7 1 19.6 <0.001 17.7 21.6 123 3 0.00404 0.951 0.951
#>
#> Columns: rowid, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, hp, gear, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
The results allow us to draw three conclusions:
- The p value for the non-inferiority test is 0.0040. This suggests that we can reject the null hypothesis that the parameter is below 17.
- The p value for the non-superiority test is 0.9508. This suggests that we cannot reject the null hypothesis that the parameter (19.6589) is above 18.
- The p value for the equivalence test is 0.9508. This suggests that we cannot reject the hypothesis that the parameter falls outside the equivalence interval.
Example 2: Model coefficients
The hypotheses
function also allows users to conduct
equivalence, non-inferiority, and non-superiority tests for model
coefficients, and for arbitrary functions of model coefficients.
Our estimate of the 4th coefficient in the model is:
coef(mod)[4]
#> factor(gear)5
#> 6.574763
We can test if this parameter is likely to fall in the [5,7] interval by:
hypotheses(mod, equivalence = c(5, 7))[4, ]
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % p (NonInf) p (NonSup) p (Equiv)
#> b4 6.57 1.64 4 <0.001 3.36 9.79 0.169 0.398 0.398
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
The p value is 0.3979, so we cannot reject the hypothesis that the factor(gear)5 parameter falls outside the [5,7] interval.
Example 3: Slopes
The same syntax can be used to conduct tests for all the quantities
produced by the marginaleffects
package. For example,
imagine that, for substantive or theoretical reasons, an average slope
between -0.1 and 0.1 is uninteresting. We can conduct an equivalence
test to check if this is the case:
avg_slopes(mod, variables = "hp", equivalence = c(-.1, .1))
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % p (NonInf) p (NonSup) p (Equiv)
#> hp -0.0669 0.011 -6.05 <0.001 -0.0885 -0.0452 0.00135 <0.001 0.00135
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
The p value is 0.0013, which suggests that we can reject the hypothesis that the parameter falls outside the region of “substantive equivalence” that we have defined by the interval.
Example 4: Difference between comparisons (contrasts)
Consider a model with a multiplicative interaction:
The average contrast for a change of 1 unit in hp
differs based on the value of gear
:
avg_comparisons(int, variables = "hp", by = "gear")
#>
#> Term Contrast gear Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> hp mean(+1) 4 -0.1792 0.0303 -5.92 <0.001 -0.2385 -0.1199
#> hp mean(+1) 3 -0.0522 0.0146 -3.59 <0.001 -0.0808 -0.0237
#> hp mean(+1) 5 -0.0583 0.0126 -4.61 <0.001 -0.0830 -0.0335
#>
#> Columns: term, contrast, gear, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo
Are these contrasts different from one another? Let’s look at the pairwise differences between them:
avg_comparisons(int, variables = "hp", by = "gear",
hypothesis = "pairwise")
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> 4 - 3 -0.12695 0.0336 -3.781 <0.001 -0.1928 -0.0611
#> 4 - 5 -0.12092 0.0328 -3.688 <0.001 -0.1852 -0.0567
#> 3 - 5 0.00603 0.0193 0.313 0.754 -0.0318 0.0438
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
We consider that these pairwise comparisons are “equivalent to zero” when they fall in the [-.1, .1] interval:
avg_comparisons(int, variables = "hp", by = "gear",
hypothesis = "pairwise",
equivalence = c(-.1, .1))
#>
#> Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % p (NonInf) p (NonSup) p (Equiv)
#> 4 - 3 -0.12695 0.0336 -3.781 <0.001 -0.1928 -0.0611 0.789 <0.001 0.789
#> 4 - 5 -0.12092 0.0328 -3.688 <0.001 -0.1852 -0.0567 0.738 <0.001 0.738
#> 3 - 5 0.00603 0.0193 0.313 0.754 -0.0318 0.0438 <0.001 <0.001 <0.001
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
The p (Equiv)
column shows that the difference between
the average contrasts when gear
is 3 and gear
is 5 can be said to be equivalent to the specified interval. However,
there are good reasons to think that the other two pairwise comparisons
may fall outside the interval.
Example 5: Marginal means and emmeans
This example shows the equivalence between results produced by the
emmeans
package and the marginal_means()
function:
library(emmeans)
mod <- lm(log(conc) ~ source + factor(percent), data = pigs)
# {emmeans}
emmeans(mod, specs = "source") |>
pairs() |>
test(df = Inf,
null = 0,
delta = log(1.25),
side = "equivalence",
adjust = "none")
#> contrast estimate SE df z.ratio p.value
#> fish - soy -0.273 0.0529 Inf 0.937 0.8257
#> fish - skim -0.402 0.0542 Inf 3.308 0.9995
#> soy - skim -0.130 0.0530 Inf -1.765 0.0388
#>
#> Results are averaged over the levels of: percent
#> Degrees-of-freedom method: user-specified
#> Results are given on the log (not the response) scale.
#> Statistics are tests of equivalence with a threshold of 0.22314
#> P values are left-tailed
# {marginaleffects}
marginal_means(
mod,
variables = "source",
hypothesis = "pairwise",
equivalence = c(-log(1.25), log(1.25)))
#>
#> Term Mean Std. Error z Pr(>|z|) 2.5 % 97.5 % p (Equiv) p (NonInf) p (NonSup)
#> fish - soy -0.273 0.0529 -5.15 <0.001 -0.377 -0.1690 0.8257 0.8257 <0.001
#> fish - skim -0.402 0.0542 -7.43 <0.001 -0.508 -0.2961 0.9995 0.9995 <0.001
#> soy - skim -0.130 0.0530 -2.44 0.0146 -0.233 -0.0255 0.0388 0.0388 <0.001
#>
#> Results averaged over levels of: percent, source
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, p.value.equiv, p.value.noninf, p.value.nonsup, statistic.noninf, statistic.nonsup
Example 6: t-test
Now we show that the results produced by hypotheses()
are identical to the results produced by the equivalence
package in the case of a simple t-test:
library(equivalence)
set.seed(1024)
# simulate data data
N <- 20
dat <- data.frame(
y = rnorm(N),
x = sample(c(rep(0, N / 2), rep(1, N / 2)), N))
# fit model
mod <- lm(y ~ x, data = dat)
# test with the {equivalence} package
e <- tost(
x = dat$y[dat$x == 0],
y = dat$y[dat$x == 1],
epsilon = 10)
e
#>
#> Welch Two Sample TOST
#>
#> data: dat$y[dat$x == 0] and dat$y[dat$x == 1]
#> df = 17.607
#> sample estimates:
#> mean of x mean of y
#> -0.3788551 -0.2724594
#>
#> Epsilon: 10
#> 95 percent two one-sided confidence interval (TOST interval):
#> -1.058539 0.845747
#> Null hypothesis of statistical difference is: rejected
#> TOST p-value: 4.248528e-13
# test with {marginaleffects} package
hypotheses(mod, equivalence = c(-10, 10), df = e$parameter)[2, ]
#>
#> Term Estimate Std. Error t Pr(>|t|) 2.5 % 97.5 % Df p (NonInf) p (NonSup) p (Equiv)
#> b2 0.106 0.548 0.194 0.848 -1.05 1.26 17.6 <0.001 <0.001 <0.001
#>
#> Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high, df, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv