datagrid

Data grids

Description

Generate a data grid of user-specified values for use in the newdata argument of the predictions(), comparisons(), and slopes() functions. This is useful to define where in the predictor space we want to evaluate the quantities of interest. Ex: the predicted outcome or slope for a 37 year old college graduate.

Usage

datagrid(
  ...,
  model = NULL,
  newdata = NULL,
  by = NULL,
  grid_type = "mean_or_mode",
  response = FALSE,
  FUN_character = NULL,
  FUN_factor = NULL,
  FUN_logical = NULL,
  FUN_numeric = NULL,
  FUN_integer = NULL,
  FUN_binary = NULL,
  FUN_other = NULL
)

Arguments

`…`	named arguments with vectors of values or functions for user-specified variables. Functions are applied to the variable in the `model` dataset or `newdata`, and must return a vector of the appropriate type. Character vectors are automatically transformed to factors if necessary. +The output will include all combinations of these variables (see Examples below.)
`model`	Model object
`newdata`	data.frame (one and only one of the `model` and `newdata` arguments can be used.)
`by`	character vector with grouping variables within which `FUN_*` functions are applied to create "sub-grids" with unspecified variables.
`grid_type`	character. Determines the functions to apply to each variable. The defaults can be overridden by defining individual variables explicitly in `…`, or by supplying a function to one of the `FUN_*` arguments. "mean_or_mode": Character, factor, logical, and binary variables are set to their modes. Numeric, integer, and other variables are set to their means. "balanced": Each unique level of character, factor, logical, and binary variables are preserved. Numeric, integer, and other variables are set to their means. Warning: When there are many variables and many levels per variable, a balanced grid can be very large. In those cases, it is better to use `grid_type=“mean_or_mode”` and to specify the unique levels of a subset of named variables explicitly. "counterfactual": the entire dataset is duplicated for each combination of the variable values specified in `…`. Variables not explicitly supplied to `datagrid()` are set to their observed values in the original dataset.
`response`	Logical should the response variable be included in the grid, even if it is not specified explicitly.
`FUN_character`	the function to be applied to character variables.
`FUN_factor`	the function to be applied to factor variables. This only applies if the variable in the original data is a factor. For variables converted to factor in a model-fitting formula, for example, `FUN_character` is used.
`FUN_logical`	the function to be applied to logical variables.
`FUN_numeric`	the function to be applied to numeric variables.
`FUN_integer`	the function to be applied to integer variables.
`FUN_binary`	the function to be applied to binary variables.
`FUN_other`	the function to be applied to other variable types.

Details

If datagrid is used in a predictions(), comparisons(), or slopes() call as the newdata argument, the model is automatically inserted in the model argument of datagrid() call, and users do not need to specify either the model or newdata arguments. The same behavior will occur when the value supplied to newdata= is a function call which starts with "datagrid". This is intended to allow users to create convenience shortcuts like:

library(marginaleffects)
mod <- lm(mpg ~ am + vs + factor(cyl) + hp, mtcars)
datagrid_bal <- function(...) datagrid(..., grid_type = "balanced")
predictions(model, newdata = datagrid_bal(cyl = 4))

If users supply a model, the data used to fit that model is retrieved using the insight::get_data function.

Value

A data.frame in which each row corresponds to one combination of the named predictors supplied by the user via the … dots. Variables which are not explicitly defined are held at their mean or mode.

Examples

library("marginaleffects")

# The output only has 2 rows, and all the variables except `hp` are at their
# mean or mode.
datagrid(newdata = mtcars, hp = c(100, 110))

       mpg    cyl     disp     drat      wt     qsec vs am   gear   carb  hp
1 20.09062 6.1875 230.7219 3.596563 3.21725 17.84875  0  0 3.6875 2.8125 100
2 20.09062 6.1875 230.7219 3.596563 3.21725 17.84875  0  0 3.6875 2.8125 110
  rowid
1     1
2     2

# We get the same result by feeding a model instead of a data.frame
mod <- lm(mpg ~ hp, mtcars)
datagrid(model = mod, hp = c(100, 110))

   hp rowid
1 100     1
2 110     2

# Use in `marginaleffects` to compute "Typical Marginal Effects". When used
# in `slopes()` or `predictions()` we do not need to specify the
#`model` or `newdata` arguments.
slopes(mod, newdata = datagrid(hp = c(100, 110)))


 Term  hp Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %
   hp 100  -0.0682     0.0101 -6.74   <0.001 35.9 -0.0881 -0.0484
   hp 110  -0.0682     0.0101 -6.74   <0.001 35.9 -0.0881 -0.0484

Type:  response 
Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, hp, predicted_lo, predicted_hi, predicted, mpg

# datagrid accepts functions
datagrid(hp = range, cyl = unique, newdata = mtcars)

       mpg     disp     drat      wt     qsec vs am   gear   carb  hp cyl rowid
1 20.09062 230.7219 3.596563 3.21725 17.84875  0  0 3.6875 2.8125  52   6     1
2 20.09062 230.7219 3.596563 3.21725 17.84875  0  0 3.6875 2.8125  52   4     2
3 20.09062 230.7219 3.596563 3.21725 17.84875  0  0 3.6875 2.8125  52   8     3
4 20.09062 230.7219 3.596563 3.21725 17.84875  0  0 3.6875 2.8125 335   6     4
5 20.09062 230.7219 3.596563 3.21725 17.84875  0  0 3.6875 2.8125 335   4     5
6 20.09062 230.7219 3.596563 3.21725 17.84875  0  0 3.6875 2.8125 335   8     6

comparisons(mod, newdata = datagrid(hp = fivenum))


 Term  hp Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %
   hp  52  -0.0682     0.0101 -6.74   <0.001 35.9 -0.0881 -0.0484
   hp  96  -0.0682     0.0101 -6.74   <0.001 35.9 -0.0881 -0.0484
   hp 123  -0.0682     0.0101 -6.74   <0.001 35.9 -0.0881 -0.0484
   hp 180  -0.0682     0.0101 -6.74   <0.001 35.9 -0.0881 -0.0484
   hp 335  -0.0682     0.0101 -6.74   <0.001 35.9 -0.0881 -0.0484

Type:  response 
Comparison: +1
Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, hp, predicted_lo, predicted_hi, predicted, mpg

# The full dataset is duplicated with each observation given counterfactual
# values of 100 and 110 for the `hp` variable. The original `mtcars` includes
# 32 rows, so the resulting dataset includes 64 rows.
dg <- datagrid(newdata = mtcars, hp = c(100, 110), grid_type = "counterfactual")
nrow(dg)

[1] 64

# We get the same result by feeding a model instead of a data.frame
mod <- lm(mpg ~ hp, mtcars)
dg <- datagrid(model = mod, hp = c(100, 110), grid_type = "counterfactual")
nrow(dg)

[1] 64