| Type: | Package | 
| Title: | Simulation-Based Inference for Regression Models | 
| Version: | 0.2.2 | 
| Description: | Performs simulation-based inference as an alternative to the delta method for obtaining valid confidence intervals and p-values for regression post-estimation quantities, such as average marginal effects and predictions at representative values. This framework for simulation-based inference is especially useful when the resulting quantity is not normally distributed and the delta method approximation fails. The methodology is described in Greifer, et al. (2025) <doi:10.32614/RJ-2024-015>. 'clarify' is meant to replace some of the functionality of the archived package 'Zelig'; see the vignette "Translating Zelig to clarify" for replicating this functionality. | 
| License: | GPL (≥ 3) | 
| Encoding: | UTF-8 | 
| Depends: | R (≥ 4.1.0) | 
| Imports: | ggplot2 (≥ 3.5.0), pbapply (≥ 1.7-4), chk (≥ 0.10.0), rlang (≥ 1.1.6), insight (≥ 1.4.2), marginaleffects (≥ 0.30.0) | 
| Suggests: | testthat (≥ 3.0.0), withr, MatchIt (≥ 4.0.0), parallel, knitr, rmarkdown, Amelia, MASS, betareg, survey, estimatr, fixest, logistf, geepack, rms, robustbase, robust, AER, ivreg, mgcv, sandwich, WeightIt (≥ 1.3.2) | 
| Config/testthat/edition: | 3 | 
| RoxygenNote: | 7.3.3 | 
| URL: | https://github.com/iqss/clarify, https://iqss.github.io/clarify/ | 
| BugReports: | https://github.com/iqss/clarify/issues | 
| VignetteBuilder: | knitr | 
| NeedsCompilation: | no | 
| Packaged: | 2025-09-19 17:01:29 UTC; NoahGreifer | 
| Author: | Noah Greifer | 
| Maintainer: | Noah Greifer <ngreifer@iq.harvard.edu> | 
| Repository: | CRAN | 
| Date/Publication: | 2025-09-19 23:40:14 UTC | 
clarify: Simulation-Based Inference for Regression Models
Description
Performs simulation-based inference as an alternative to the delta method for obtaining valid confidence intervals and p-values for regression post-estimation quantities, such as average marginal effects and predictions at representative values. This framework for simulation-based inference is especially useful when the resulting quantity is not normally distributed and the delta method approximation fails. The methodology is described in Greifer, et al. (2025) doi:10.32614/RJ-2024-015. 'clarify' is meant to replace some of the functionality of the archived package 'Zelig'; see the vignette "Translating Zelig to clarify" for replicating this functionality.
Author(s)
Maintainer: Noah Greifer ngreifer@iq.harvard.edu (ORCID)
Authors:
- Steven Worthington sworthington@iq.harvard.edu (ORCID) 
- Stefano Iacus siacus@iq.harvard.edu (ORCID) 
- Gary King king@harvard.edu (ORCID) 
References
Greifer, N., Worthington, S., Iacus, S., & King, G. (2025). clarify: Simulation-Based Inference for Regression Models. The R Journal 16(2), 154–174. doi:10.32614/RJ-2024-015
See Also
Useful links:
- Report bugs at https://github.com/iqss/clarify/issues 
Simulate model coefficients after multiple imputation
Description
misim() simulates model parameters from multivariate normal or t distributions after multiple imputation that are then used by sim_apply() to calculate quantities of interest.
Usage
misim(fitlist, n = 1000, vcov = NULL, coefs = NULL, dist = NULL)
Arguments
| fitlist | a list of model fits, one for each imputed dataset, or a  | 
| n | the number of simulations to run for each imputed dataset; default is 1000. More is always better but resulting calculations will take longer. | 
| vcov | a square covariance matrix of the coefficient covariance estimates, a function to use to extract it from  | 
| coefs | a vector of coefficient estimates, a function to use to extract it from  | 
| dist | a character vector containing the name of the multivariate distribution(s) to use to draw simulated coefficients. Should be one of  | 
Details
misim() essentially combines multiple sim() calls applied to a list of model fits, each fit in an imputed dataset, into a single combined pool of simulated coefficients. When simulation-based inference is to be used with multiply imputed data, many imputations are required; see Zhou and Reiter (2010).
Value
A clarify_misim object, which inherits from clarify_sim and has the following components:
| sim.coefs | a matrix containing the simulated coefficients with a column for each coefficient and a row for each simulation for each imputation | 
| coefs | a matrix containing the original coefficients extracted from  | 
| fit | the list of model fits supplied to  | 
| imp | a identifier of which imputed dataset each set of simulated coefficients corresponds to. | 
The "dist" attribute contains "normal" if the coefficients were sampled from a multivariate normal distribution and "t({df})" if sampled from a multivariate t distribution. The "clarify_hash" attribute contains a unique hash generated by rlang::hash().
References
Zhou, X., & Reiter, J. P. (2010). A Note on Bayesian Inference After Multiple Imputation. The American Statistician, 64(2), 159–163. doi:10.1198/tast.2010.09109
See Also
-  sim()for simulating model coefficients for a single dataset
-  sim_apply()for applying a function to each set of simulated coefficients
-  sim_ame()for computing average marginal effects in each simulation draw
-  sim_setx()for computing marginal predictions and first differences at typical values in each simulation draw
Examples
data("africa", package = "Amelia")
# Multiple imputation using Amelia
a.out <- Amelia::amelia(x = africa, m = 10,
                        cs = "country",
                        ts = "year", logs = "gdp_pc",
                        p2s = 0)
fits <- with(a.out, lm(gdp_pc ~ infl * trade))
# Simulate coefficients
s <- misim(fits)
s
Plot marginal predictions from sim_adrf()
Description
plot.clarify_adrf() plots the output of sim_adrf(). For the average dose-response function (ADRF, requested with contrast = "adrf" in sim_adrf()), this is a plot of the average marginal mean of the outcome against the requested values of the focal predictor; for the average marginal effects function (AMEF, requested with contrast = "amef" in sim_adrf()), this is a plot of the instantaneous average marginal effect of the focal predictor on the outcome against the requested values of the focal predictor.
Usage
## S3 method for class 'clarify_adrf'
plot(
  x,
  ci = TRUE,
  level = 0.95,
  method = "quantile",
  baseline = NULL,
  color = "black",
  simultaneous = FALSE,
  ...
)
Arguments
| x | a  | 
| ci | 
 | 
| level | the confidence level desired. Default is .95 for 95% confidence intervals. | 
| method | the method used to compute confidence bands. Can be  | 
| baseline | 
 | 
| color | the color of the line and confidence band in the plot. | 
| simultaneous | 
 | 
| ... | for  | 
Details
These plots are produced using ggplot2::geom_line() and ggplot2::geom_ribbon(). The confidence bands should be interpreted pointwise (i.e., they do not account for simultaneous inference) unless simultaneous = TRUE.
Value
A ggplot object.
See Also
summary.clarify_est() for computing p-values and confidence intervals for the estimated quantities.
Examples
## See help("sim_adrf") for examples
Plotting and inference for clarify_est objects
Description
summary() tabulates the estimates and confidence intervals and (optionally) p-values from a clarify_est object. confint() computes confidence intervals. plot() plots the "posterior" distribution of estimates.
Usage
## S3 method for class 'clarify_est'
plot(
  x,
  parm,
  ci = TRUE,
  level = 0.95,
  method = "quantile",
  reference = FALSE,
  ncol = 3L,
  simultaneous = FALSE,
  ...
)
## S3 method for class 'clarify_est'
summary(
  object,
  parm,
  level = 0.95,
  method = "quantile",
  null = NA,
  simultaneous = FALSE,
  ...
)
## S3 method for class 'clarify_est'
confint(
  object,
  parm,
  level = 0.95,
  method = "quantile",
  simultaneous = FALSE,
  ...
)
Arguments
| parm | a vector of the names or indices of the estimates to plot. If unspecified, all estimates will be displayed. | 
| ci | 
 | 
| level | the confidence level desired. Default is .95 for 95% confidence intervals. | 
| method | the method used to compute p-values and confidence intervals. Can be  | 
| reference | 
 | 
| ncol | the number of columns used when wrapping multiple plots; default is 3. | 
| simultaneous | 
 | 
| ... | for  | 
| object,x | a  | 
| null | the values of the parameters under the null hypothesis for the p-value calculations. Should have length equal to the number of quantities estimated, or one, in which case it will be recycled, or it can be a named vector with just the names of quantities for which null values are to be set. Set values to  | 
Details
summary() uses the estimates computed from the original model as its estimates and uses the simulated parameters for inference only, in line with the recommendations of Rainey (2023).
When method = "wald", the standard deviation of the simulation estimates is used as the standard error, which is used in the z-statistics and the confidence intervals. The p-values and confidence intervals are valid only when the sampling distribution of the resulting statistic is normal (which can be assessed using plot()). When method = "quantile", the confidence interval is calculated using the quantiles of the simulation estimates corresponding to level, and the p-value is calculated as twice the proportion of simulation estimates less than or greater than null, whichever is smaller; this is equivalent to inverting the confidence interval but is only truly valid when the true sampling distribution is only a location shift from the sampling distribution under the null hypothesis and should therefore be interpreted with caution. Using "method = "quantile" (the default) is recommended because the confidence intervals will be valid even if the sampling distribution is not Normally distributed. The precision of the p-values and confidence intervals depends on the number of simulations requested (the value of n supplied to sim()).
When simultaneous = TRUE, confidence intervals and p-values are adjusted to account for multiple comparisons using the "sup-t" confidence region and its inversion. The sup-t confidence region is the smallest rectangular region that ensures simultaneous coverage of all parameters at the desired confidence level. It is found by adjusting the nominal confidence level until a new level is found that, when simultaneously applied to all estimates, yields intervals that contain all estimates with rates equal to the original level. Unlike some other adjustments for multiple comparisons (e.g., Holm, Bonferroni, Benjamini-Hochberg), this method takes into account the joint distribution of the estimates, often yielding narrower regions that are less conservative (but still valid). P-values are found by finding the level of the narrowest band that guarantees simultaneous coverage while containing the null value of the given parameter. When method = "quantile", the Bayesian algorithm described by Montiel Olea and Plagborg-Møller (2019) is used with the simulated estimates; when method = "wald", a new simulation is performed treating the estimates as coming from a multivariate normal distribution.
The plots are produced using ggplot2::geom_density() and can be customized with ggplot2 functions. When reference = TRUE, a reference Normal distribution is produced using the empirical mean and standard deviation of the simulated values. A blue references line is plotted at the median of the simulated values. For Wald-based inference to be valid, the reference distribution should overlap with the empirical distribution, in which case the quantile-based and Wald-based intervals should be similar. For quantile-based inference to be valid, the median of the estimates should overlap with the estimated value; this is a necessary but not sufficient condition, though.
Value
For summary(), a summary.clarify_est object, which is a matrix containing the coefficient estimates, standard errors, test statistics, p-values, and confidence intervals. Not all columns will be present depending on the arguments supplied to summary().
For confint(), a matrix containing the confidence intervals for the requested quantities.
For plot(), a ggplot object.
References
Montiel Olea, J. L., & Plagborg-Møller, M. (2019). Simultaneous confidence bands: Theory, implementation, and an application to SVARs. Journal of Applied Econometrics, 34(1), 1–17. doi:10.1002/jae.2656
Rainey, C. (2023). A careful consideration of CLARIFY: Simulation-induced bias in point estimates of quantities of interest. Political Science Research and Methods, 1–10. doi:10.1017/psrm.2023.8
See Also
-  sim_apply()for applying a function to each set of simulated coefficients
Examples
data("lalonde", package = "MatchIt")
fit <- glm(I(re78 > 0) ~ treat + age + race + nodegree + re74,
          data = lalonde)
s <- sim(fit, n = 100)
# Compute average marginal means for `treat`
est <- sim_ame(s, var = "treat", verbose = FALSE)
coef(est)
# Compute average marginal effects on risk difference
# (RD) and risk ratio (RR) scale
est <- transform(est,
                 RD = `E[Y(1)]` - `E[Y(0)]`,
                 RR = `E[Y(1)]` / `E[Y(0)]`)
# Compute confidence intervals and p-values,
# using given null values for computing p-values
summary(est, null = c(`RD` = 0, `RR` = 1))
# Same tests using normal approximation and alternate
# syntax for `null`
summary(est, null = c(NA, NA, 0, 1),
        normal = TRUE)
# Plot the RD and RR with a reference distribution
plot(est, parm = c("RD", "RR"), reference = TRUE,
     ci = FALSE)
# Plot the RD and RR with quantile confidence bounds
plot(est, parm = c("RD", "RR"), ci = TRUE)
Plot marginal predictions from sim_setx()
Description
plot.clarify_sext() plots the output of sim_setx(), providing graphics similar to those of plot.clarify_est() but with features specifically for plot marginal predictions. For continues predictors, this is a plot of the marginal predictions and their confidence bands across levels of the predictor. Otherwise, this is is a plot of simulated sampling distribution of the marginal predictions.
Usage
## S3 method for class 'clarify_setx'
plot(
  x,
  var = NULL,
  ci = TRUE,
  level = 0.95,
  method = "quantile",
  reference = FALSE,
  simultaneous = FALSE,
  ...
)
Arguments
| x | a  | 
| var | the name of the focal varying predictor, i.e., the variable to be on the x-axis of the plot. All other variables with varying set values will be used to color the resulting plot. See Details. Ignored if no predictors vary or if only one predictor varies in the reference grid or if  | 
| ci | 
 | 
| level | the confidence level desired. Default is .95 for 95% confidence intervals. | 
| method | the method used to compute confidence intervals or bands. Can be  | 
| reference | 
 | 
| simultaneous | 
 | 
| ... | for  | 
Details
plot() creates one of two kinds of plots depending on how the reference grid was specified in the call to sim_setx() and what var is set to. When the focal varying predictor (i.e., the one set in var) is numeric and takes on three or more unique values in the reference grid, the produced plot is a line graph displaying the value of the marginal prediction (denoted as E[Y|X]) across values of the focal varying predictor, with confidence bands displayed when ci = TRUE. If other predictors also vary, lines for different values will be displayed in different colors. These plots are produced using ggplot2::geom_line() and ggplot2::geom_ribbon()
When the focal varying predictor is a factor or character or only takes on two or fewer values in the reference grid, the produced plot is a density plot of the simulated predictions, similar to the plot resulting from plot.clarify_est(). When other variables vary, densities for different values will be displayed in different colors. These plots are produced using ggplot2::geom_density().
Marginal predictions are identified by the corresponding levels of the predictors that vary. The user should keep track of whether the non-varying predictors are set at specified or automatically set "typical" levels.
Value
A ggplot object.
See Also
summary.clarify_est() for computing p-values and confidence intervals for the estimated quantities.
Examples
## See help("sim_setx") for examples
Simulate model parameters
Description
sim() simulates model parameters from a multivariate normal or t distribution that are then used by sim_apply() to calculate quantities of interest.
Usage
sim(fit, n = 1000L, vcov = NULL, coefs = NULL, dist = NULL)
Arguments
| fit | a model fit, such as the output of a call to  | 
| n | the number of simulations to run; default is 1000. More is always better but resulting calculations will take longer. | 
| vcov | either a square covariance matrix of the coefficient covariance estimates or a function to use to extract it from  | 
| coefs | either a vector of coefficient estimates or a function to use to extract it from  | 
| dist | a string containing the name of the multivariate distribution to use to draw simulated coefficients. Should be one of  | 
Details
When dist is NULL, sim() samples from a multivariate normal or t distribution depending on the degrees of freedom extracted from insight::get_df(., type = "wald"). If Inf, a normal distribution will be used; otherwise, a t-distribution with the returned degrees of freedom will be used. Models not supported by insight will use a normal distribution.
Value
A clarify_sim object, which has the following components:
| sim.coefs | a matrix containing the simulated coefficients with a column for each coefficient and a row for each simulation | 
| coefs | the original coefficients extracted from  | 
| vcov | the covariance matrix of the coefficients extracted from  | 
| fit | the original model fit supplied to  | 
The "dist" attribute contains "normal" if the coefficients were sampled from a multivariate normal distribution and "t(df)" if sampled from a multivariate t distribution. The "clarify_hash" attribute contains a unique hash generated by rlang::hash().
See Also
-  misim()for simulating model coefficients after multiple imputation
-  sim_apply()for applying a function to each set of simulated coefficients
-  sim_ame()for computing average marginal effects in each simulation draw
-  sim_setx()for computing marginal predictions and first differences at typical values in each simulation draw
-  sim_adrf()for computing average dose-response functions in each simulation draw
Examples
data("lalonde", package = "MatchIt")
fit <- lm(re78 ~ treat * (age + race + nodegree + re74),
          data = lalonde)
# Simulate coefficients
s <- sim(fit)
s
## Could also use a robust covariance matrix, e.g.,
s <- sim(fit, vcov = "HC3")
# Simulated coefficients assuming a normal distribution
# for coefficients; default for `lm` objects is a t-
# distribution
s <- sim(fit, dist = "normal")
s
Compute an average dose-response function
Description
sim_adrf() is a wrapper for sim_apply() that computes average dose-response functions (ADRFs) and average marginal effect functions (AMEFs). An ADRF describes the relationship between values a focal variable can take and the expected value of the outcome were all units to be given each value of the variable. An AMEF describes the relationship between values a focal variable can take and the derivative of ADRF at each value.
Usage
sim_adrf(
  sim,
  var,
  subset = NULL,
  by = NULL,
  contrast = "adrf",
  at = NULL,
  n = 21,
  outcome = NULL,
  type = NULL,
  eps = 1e-05,
  verbose = TRUE,
  cl = NULL,
  ...
)
## S3 method for class 'clarify_adrf'
print(x, digits = 4L, max.ests = 6L, ...)
Arguments
| sim | a  | 
| var | the name of a variable for which the ADRF or AMEF is to be computed. This variable must be present in the model supplied to  | 
| subset | optional; a vector used to subset the data used to compute the ADRF or AMEF. This will be evaluated within the original dataset used to fit the model using  | 
| by | a one-sided formula or character vector containing the names of variables for which to stratify the estimates. Each quantity will be computed within each level of the complete cross of the variables specified in  | 
| contrast | a string naming the type of quantity to be produced:  | 
| at | the levels of the variable named in  | 
| n | when  | 
| outcome | a string containing the name of the outcome or outcome level for multivariate (multiple outcomes) or multi-category outcomes. Ignored for univariate (single outcome) and binary outcomes. | 
| type | a string containing the type of predicted values (e.g., the link or the response). Passed to  | 
| eps | when  | 
| verbose | 
 | 
| cl | a cluster object created by  | 
| ... | for  | 
| x | a  | 
| digits | the minimum number of significant digits to be used; passed to  | 
| max.ests | the maximum number of estimates to display. | 
Details
The ADRF is composed of average marginal means across levels of the focal predictor. For each level of the focal predictor, predicted values of the outcome are computed after setting the value of the predictor to that level, and those values of the outcome are averaged across all units in the sample to arrive at an average marginal mean. Thus, the ADRF represent the relationship between the "dose" (i.e., the level of the focal predictor) and the average "response" (i.e., the outcome variable). It is the continuous analog to the average marginal effect computed for a binary predictor, e.g., using sim_ame(). Although inference can be at each level of the predictor or between two levels of the predictor, typically a plot of the ADRF is the most useful relevant quantity. These can be requested using plot.clarify_adrf().
The AMEF is the derivative of the ADRF; if we call the derivative of the ADRF at each point a "treatment effect" (i.e., the rate at which the outcome changes corresponding to a small change in the predictor, or "treatment"), the AMEF is a function that relates the size of the treatment effect to the level of the treatment. The shape of the AMEF is usually of less importance than the value of the AMEF at each level of the predictor, which corresponds to the size of the treatment effect at the corresponding level. The AMEF is computed by computing the ADRF at each level of the focal predictor specified in at, shifting the predictor value by a tiny amount (control by eps), and computing the ratio of the change in the outcome to the shift, then averaging this value across all units. This quantity is related the the average marginal effect of a continuous predictor as computed by sim_ame(), but rather than average these treatment effects across all observed levels of the treatment, the AMEF is a function evaluated at each possible level of the treatment. The "tiny amount" used is eps times the standard deviation of var.
Value
A clarify_adrf object, which inherits from clarify_est and is similar to
the output of sim_apply(), with the additional attributes "var" containing
the variable named in var, "by" containing the names of the variables specified in by (if any), "at" containing values at which the ADRF or AMEF is evaluated, and "contrast" containing the argument supplied to contrast. For an ADRF, the average marginal means will be named
E[Y({v})], where {v} is replaced with the values in at. For an AMEF, the average marginal effects will be
named dY/d({x})|{a} where {x} is replaced with var and {a} is replaced by the values in at.
See Also
plot.clarify_adrf() for plotting the ADRF or AMEF; sim_ame() for computing average marginal effects; sim_apply(), which provides a general interface to computing any quantities for simulation-based inference; summary.clarify_est() for computing p-values and confidence intervals for the estimated quantities.
marginaleffects::avg_slopes() and marginaleffects::avg_predictions() for delta method-based implementations of computing average marginal effects and average marginal means.
Examples
data("lalonde", package = "MatchIt")
# Fit the model
fit <- glm(I(re78 > 0) ~ treat + age + race +
             married + re74,
           data = lalonde, family = binomial)
# Simulate coefficients
set.seed(123)
s <- sim(fit, n = 100)
# ADRF for `age`
est <- sim_adrf(s, var = "age",
                at = seq(15, 55, length.out = 6),
                verbose = FALSE)
est
plot(est)
# AMEF for `age`
est <- sim_adrf(s, var = "age", contrast = "amef",
               at = seq(15, 55, length.out = 6),
               verbose = FALSE)
est
summary(est)
plot(est)
# ADRF for `age` within levels of `married`
est <- sim_adrf(s, var = "age",
                at = seq(15, 55, length.out = 6),
                by = ~married,
                verbose = FALSE)
est
plot(est)
## Difference between ADRFs
est_diff <- est[7:12] - est[1:6]
plot(est_diff) + ggplot2::labs(y = "Diff")
Compute average marginal effects
Description
sim_ame() is a wrapper for sim_apply() that computes average marginal effects, the average effect of changing a single variable from one value to another (i.e., from one category to another for categorical variables or a tiny change for continuous variables).
Usage
sim_ame(
  sim,
  var,
  subset = NULL,
  by = NULL,
  contrast = NULL,
  outcome = NULL,
  type = NULL,
  eps = 1e-05,
  verbose = TRUE,
  cl = NULL,
  ...
)
## S3 method for class 'clarify_ame'
print(x, digits = 4L, max.ests = 6L, ...)
Arguments
| sim | a  | 
| var | either the names of the variables for which marginal effects are to be computed or a named list containing the values the variables should take. See Details. | 
| subset | optional; a vector used to subset the data used to compute the marginal effects. This will be evaluated within the original dataset used to fit the model using  | 
| by | a one-sided formula or character vector containing the names of
variables for which to stratify the estimates. Each quantity will be
computed within each level of the complete cross of the variables specified
in  | 
| contrast | a string containing the name of a contrast between the
average marginal means when the variable named in  | 
| outcome | a string containing the name of the outcome or outcome level for multivariate (multiple outcomes) or multi-category outcomes. Ignored for univariate (single outcome) and binary outcomes. | 
| type | a string containing the type of predicted values (e.g., the link
or the response). Passed to  | 
| eps | when the variable named in  | 
| verbose | 
 | 
| cl | a cluster object created by  | 
| ... | for  | 
| x | a  | 
| digits | the minimum number of significant digits to be used; passed to  | 
| max.ests | the maximum number of estimates to display. | 
Details
sim_ame() computes average adjusted predictions or average marginal effects depending on which variables are named in var and how they are specified. Canonically, var should be specified as a named list with the value(s) each variable should be set to. For example, specifying var = list(x1 = 0:1) computes average adjusted predictions setting x1 to 0 and 1. Specifying a variable's values as NULL, e.g., list(x1 = NULL), is equivalent to requesting average adjusted predictions at each unique value of the variable when that variable is binary or a factor or character and requests the average marginal effect of that variable otherwise. Specifying an unnamed entry in the list with a string containing the value of that variable, e.g., list("x1") is equivalent to specifying list(x1 = NULL). Similarly, supplying a vector with the names of the variables is equivalent to specifying a list, e.g., var = "x1" is equivalent to var = list(x1 = NULL).
Multiple variables can be supplied to var at the same time to set the corresponding variables to those values. If all values are specified directly or the variables are categorical, e.g., list(x1 = 0:1, x2 = c(5, 10)), this computes average adjusted predictions at each combination of the supplied variables. If any one variable's values is specified as NULL and the variable is continuous, the average marginal effect of that variable will be computed with the other variables set to their corresponding combinations. For example, if x2 is a continuous variable, specifying var = list(x1 = 0:1, x2 = NULL) requests the average marginal effect of x2 computed first setting x1 to 0 and then setting x1 to 1. The average marginal effect can only be computed for one variable at a time.
Below are some examples of specifications and what they request, assuming x1 is a binary variable taking on values of 0 and 1 and x2 is a continuous variable:
-  list(x1 = 0:1),list(x1 = NULL),list("x1"),"x1"– the average adjusted predictions settingx1to 0 and to 1
-  list(x2 = NULL),list("x2"),"x2"– the average marginal effect ofx2
-  list(x2 = c(5, 10))– the average adjusted predictions settingx2to 5 and to 10
-  list(x1 = 0:1, x2 = c(5, 10)),list("x1", x2 = c(5, 10))– the average adjusted predictions settingx1andx2in a full cross of 0, 1 and 5, 10, respectively (e.g., (0, 5), (0, 10), (1, 5), and (1, 10))
-  list(x1 = 0:1, "x2"),list("x1", "x2"),c("x1", "x2")– the average marginal effects ofx2settingx1to 0 and to 1
The average adjusted prediction is the average predicted outcome
value after setting all units' value of a variable to a specified level. (This quantity
has several names, including the average potential outcome, average marginal mean, and standardized mean). When exactly two average adjusted predictions are requested, a contrast
between them can be requested by supplying an argument
to contrast (see Effect Measures section below). Contrasts can be manually computed using transform()
afterward as well; this is required when multiple average adjusted predictions are requested (i.e., because a single variable was supplied to var with more than two levels or a combination of multiple variables was supplied).
A marginal effect is the instantaneous rate of change
corresponding to changing a unit's observed value of a variable by a tiny amount
and considering to what degree the predicted outcome changes. The ratio of
the change in the predicted outcome to the change in the value of the variable is
the marginal effect; these are averaged across the sample to arrive at an
average marginal effect. The "tiny amount" used is eps times the standard
deviation of the focal variable.
The difference between using by or subset vs. var is that by and subset subset the data when computing the requested quantity, whereas var sets the corresponding variable to given a value for all units. For example, using by = ~v computes the quantity of interest separately for each subset of the data defined by v, whereas setting var = list(., "v") computes the quantity of interest for all units setting their value of v to its unique values. The resulting quantities have different interpretations. Both by and var can be used simultaneously.
Effect measures
The effect measures specified in contrast are defined below. Typically only
"diff" is appropriate for continuous outcomes and "diff" or "irr" are
appropriate for count outcomes; the rest are appropriate for binary outcomes.
For a focal variable with two levels, 0 and 1, and an outcome Y, the
average marginal means will be denoted in the below formulas as E[Y(0)] and
E[Y(1)], respectively.
| contrast | Description | Formula | 
| "diff"/"rd" | Mean/risk difference | E[Y(1)] - E[Y(0)] | 
| "rr"/"irr" | Risk ratio/incidence rate ratio | E[Y(1)] / E[Y(0)] | 
| "sr" | Survival ratio | (1 - E[Y(1)]) / (1 - E[Y(0)]) | 
| "srr"/"grrr" | Switch risk ratio | 1 - srifE[Y(1)] > E[Y(0)] | 
| rr - 1ifE[Y(1)] < E[Y(0)] | ||
| 0otherwise | ||
| "or" | Odds ratio | O[Y(1)] / O[Y(0)] | 
| where O[Y(.)]=E[Y(.)] / (1 - E[Y(.)]) | ||
| "nnt" | Number needed to treat | 1 / rd | 
The log(.) versions are defined by taking the log() (natural log) of the
corresponding effect measure.
Value
A clarify_ame object, which inherits from clarify_est and is
similar to the output of sim_apply(), with the additional attributes
"var" containing the variable values specified in var and "by" containing the
names of the variables specified in by (if any). The average adjusted
predictions will be named E[Y({v})], where {v} is replaced with the
values the variables named in var take on. The average marginal effect for a
continuous var will be named E[dY/d({x})] where {x} is replaced with
var. When by is specified, the average adjusted predictions will be named
E[Y({v})|{b}] and the average marginal effect E[dY/d({x})|{b}] where
{b} is a comma-separated list of of values of the by variables at which
the quantity is computed. See examples.
See Also
sim_apply(), which provides a general interface to computing any quantities for simulation-based inference; plot.clarify_est() for plotting the output of a call to sim_ame(); summary.clarify_est() for computing p-values and confidence intervals for the estimated quantities.
marginaleffects::avg_predictions(), marginaleffects::avg_comparisons() and marginaleffects::avg_slopes() for delta method-based implementations of computing average marginal effects.
Examples
data("lalonde", package = "MatchIt")
# Fit the model
fit <- glm(I(re78 > 0) ~ treat + age + race +
             married + re74,
           data = lalonde, family = binomial)
# Simulate coefficients
set.seed(123)
s <- sim(fit, n = 100)
# Average marginal effect of `age`
est <- sim_ame(s, var = "age", verbose = FALSE)
summary(est)
# Contrast between average adjusted predictions
# for `treat`
est <- sim_ame(s, var = "treat", contrast = "rr",
               verbose = FALSE)
summary(est)
# Average adjusted predictions for `race`; need to follow up
# with contrasts for specific levels
est <- sim_ame(s, var = "race", verbose = FALSE)
est <- transform(est,
                 `RR(h,b)` = `E[Y(hispan)]` / `E[Y(black)]`)
summary(est)
# Average adjusted predictions for `treat` within levels of
# `married`, first using `subset` and then using `by`
est0 <- sim_ame(s, var = "treat", subset = married == 0,
                contrast = "rd", verbose = FALSE)
names(est0) <- paste0(names(est0), "|married=0")
est1 <- sim_ame(s, var = "treat", subset = married == 1,
                contrast = "rd", verbose = FALSE)
names(est1) <- paste0(names(est1), "|married=1")
summary(cbind(est0, est1))
est <- sim_ame(s, var = "treat", by = ~married,
               contrast = "rd", verbose = FALSE)
est
summary(est)
# Average marginal effect of `age` within levels of
# married*race
est <- sim_ame(s, var = "age", by = ~married + race,
               verbose = FALSE)
est
summary(est, null = 0)
# Comparing AMEs between married and unmarried for
# each level of `race`
est_diff <- est[4:6] - est[1:3]
names(est_diff) <- paste0("AME_diff|", levels(lalonde$race))
summary(est_diff)
# Average adjusted predictions at a combination of `treat`
# and `married`
est <- sim_ame(s, var = c("treat", "married"),
               verbose = FALSE)
est
# Average marginal effect of `age` setting `married` to 1
est <- sim_ame(s, var = list("age", married = 1),
               verbose = FALSE)
Apply a function to simulated parameter values
Description
sim_apply() applies a function that produces quantities of interest to each set of simulated coefficients produced by sim(); these calculated quantities form the posterior sampling distribution for the quantities of interest. Capabilities are available for parallelization.
Usage
sim_apply(sim, FUN, verbose = TRUE, cl = NULL, ...)
## S3 method for class 'clarify_est'
print(x, digits = 4L, max.ests = 6L, ...)
Arguments
| sim | a  | 
| FUN | a function to be applied to each set of simulated coefficients. See Details. | 
| verbose | 
 | 
| cl | a cluster object created by  | 
| ... | for  | 
| x | a  | 
| digits | the minimum number of significant digits to be used; passed to  | 
| max.ests | the maximum number of estimates to display. | 
Details
sim_apply() applies a function, FUN, to each set of simulated coefficients, similar to apply(). This function should return a numeric vector containing one or more estimated quantities. This should be a named vector to more easily keep track of the meaning of each estimated quantity. Care should be taken to ensure that the returned vector is the same length each time FUN is called. NAs are allowed in the output but should be avoided if possible.
The arguments to FUN can be specified in a few ways. If FUN has an
argument called coefs, a simulated set of coefficients will be passed to
this argument, and FUN should compute and return a quantity based on the
coefficients (e.g., the difference between two coefficients if one wants to
test whether two coefficients are equal). If FUN has an argument called
fit, a model fit object of the same type as the one originally supplied
to sim() (e.g., an lm or glm object) will be passed to this argument,
where the coefficients of the fit object have been replaced by the
simulated coefficients generated by sim(), and FUN should compute and
return a quantity based on the model fit (e.g., a computation based on the
output of predict()). If neither coefs nor fit are the names of
arguments to FUN, the model fit object with replaced coefficients will be
supplied to the first argument of FUN.
When custom coefficients are supplied to sim(), i.e., when the coefs
argument to sim() is not left at its default value, FUN must accept a
coefs argument and a warning will be thrown if it accepts a fit
argument. This is because sim_apply() does not know how to reconstruct
the original fit object with the new coefficients inserted. The quantities
computed by sim_apply() must therefore be computed directly from the
coefficients.
If FUN is not supplied at all, the simulated values of the coefficients will be returned in the output with a warning. Set FUN to NULL or verbose to FALSE to suppress this warning.
sim_apply() with multiply imputed data
When using misim() and sim_apply() with multiply imputed data, the
coefficients are supplied to the model fit corresponding to the imputation
identifier associated with each set of coefficients, which means if FUN
uses a dataset extracted from a model (e.g., using insight::get_data()), it will do so from the model fit in
the corresponding imputation.
The original estimates (see Value below) are computed as the mean of the
estimates across the imputations using the original coefficients averaged
across imputations. That is, first, the coefficients estimated in the
models in the imputed datasets are combined to form a single set of pooled
coefficients; then, for each imputation, the quantities of interest are
computed using the pooled coefficients; finally, the mean of the resulting
estimates across the imputations are taken as the "original" estimates.
Note this procedure is only valid for quantities with symmetric sampling
distributions, which excludes quantities like risk ratios and odds ratios,
but includes log risk ratios and log odds ratios. The desired quantities
can be transformed from their log versions using
transform().
Value
A clarify_est object, which is a matrix with a column for each
estimated quantity and a row for each simulation. The original estimates
(FUN applied to the original coefficients or model fit object) are stored
in the attribute "original". The "sim_hash" attribute contains the
simulation hash produced by sim().
See Also
-  sim()for generating the simulated coefficients
-  summary.clarify_est()for computing p-values and confidence intervals for the estimated quantities
-  plot.clarify_est()for plotting estimated quantities and their simulated posterior sampling distribution.
Examples
data("lalonde", package = "MatchIt")
fit <- lm(re78 ~ treat + age + race + nodegree + re74,
          data = lalonde)
coef(fit)
set.seed(123)
s <- sim(fit, n = 500)
# Function to compare predicted values for two units
# using `fit` argument
sim_fun <- function(fit) {
  pred1 <- unname(predict(fit, newdata = lalonde[1,]))
  pred2 <- unname(predict(fit, newdata = lalonde[2,]))
  c(pred1 = pred1, pred2 = pred2)
}
est <- sim_apply(s, sim_fun, verbose = FALSE)
# Add difference between predicted values as
# additional quantity
est <- transform(est, `diff 1-2` = pred1 - pred2)
# Examine estimates and confidence intervals
summary(est)
# Function to compare coefficients using `coefs`
# argument
sim_fun <- function(coefs) {
  setNames(coefs["racewhite"] - coefs["racehispan"],
           "wh - his")
}
est <- sim_apply(s, sim_fun, verbose = FALSE)
# Examine estimates and confidence intervals
summary(est)
# Another way to do the above:
est <- sim_apply(s, FUN = NULL)
est <- transform(est,
                 `wh - his` = `racewhite` - `racehispan`)
summary(est, parm = "wh - his")
Compute predictions and first differences at set values
Description
sim_setx() is a wrapper for sim_apply() that computes predicted values of the outcome at specified values of the predictors, sometimes called marginal predictions. One can also compute the difference between two marginal predictions (the "first difference"). Although any function that accepted clarify_est objects can be used with sim_setx() output objects, a special plotting function, plot.clarify_setx(), can be used to plot marginal predictions.
Usage
sim_setx(
  sim,
  x = list(),
  x1 = list(),
  outcome = NULL,
  type = NULL,
  verbose = TRUE,
  cl = NULL,
  ...
)
## S3 method for class 'clarify_setx'
print(x, digits = 4L, max.ests = 6L, ...)
Arguments
| sim | a  | 
| x | a data.frame containing a reference grid of predictor values or a named list of values each predictor should take defining such a reference grid, e.g.,  For  | 
| x1 | a data.frame or named list of the value each predictor should take to compute the
first difference from the predictor combination specified in  | 
| outcome | a string containing the name of the outcome or outcome level for multivariate (multiple outcomes) or multi-category outcomes. Ignored for univariate (single outcome) and binary outcomes. | 
| type | a string containing the type of predicted values (e.g., the link or the response). Passed to  | 
| verbose | 
 | 
| cl | a cluster object created by  | 
| ... | for  | 
| digits | the minimum number of significant digits to be used; passed to  | 
| max.ests | the maximum number of estimates to display. | 
Details
When x is a named list of predictor values, they will be crossed
to form a reference grid for the marginal predictions. Any predictors not
set in x are assigned their "typical" value, which, for factor,
character, logical, and binary variables is the mode, for numeric variables
is the mean, and for ordered variables is the median. These values can be
seen in the "setx" attribute of the output object. If x is empty, a
prediction will be made at a point corresponding to the typical value of
every predictor. Estimates are identified (in summary(), etc.) only by
the variables that differ across predictions.
When x1 is supplied, the first difference is computed, which here is
considered as the difference between two marginal predictions. One marginal
prediction must be specified in x and another, ideally with a single
predictor changed, specified in x1.
Value
A clarify_setx object, which inherits from clarify_est and is similar to the output of sim_apply(), with the following additional attributes:
-  "setx"- a data frame containing the values at which predictions are to be made
-  "fd"- whether or not the first difference is to be computed; set toTRUEifx1is specified andFALSEotherwise
See Also
sim_apply(), which provides a general interface to computing any
quantities for simulation-based inference; plot.clarify_setx() for plotting the
output of a call to sim_setx(); summary.clarify_est() for computing
p-values and confidence intervals for the estimated quantities.
Examples
data("lalonde", package = "MatchIt")
fit <- lm(re78 ~ treat + age + educ + married + race + re74,
          data = lalonde)
# Simulate coefficients
set.seed(123)
s <- sim(fit, n = 100)
# Predicted values at specified values of values, typical
# values for other predictors
est <- sim_setx(s, x = list(treat = 0:1,
                            re74 = c(0, 10000)),
                verbose = FALSE)
summary(est)
plot(est)
# Predicted values at specified grid of values, typical
# values for other predictors
est <- sim_setx(s, x = list(age = c(20, 25, 30, 35),
                            married = 0:1),
                verbose = FALSE)
summary(est)
plot(est)
# First differences of treat at specified value of
# race, typical values for other predictors
est <- sim_setx(s, x = data.frame(treat = 0, race = "hispan"),
                x1 = data.frame(treat = 1, race = "hispan"),
                verbose = FALSE)
summary(est)
plot(est)
Transform and combine clarify_est objects
Description
transform() modifies a clarify_est object by allowing for the calculation of new quantities from the existing quantities without re-simulating them. cbind() binds two clarify_est objects together.
Usage
## S3 method for class 'clarify_est'
transform(`_data`, ...)
## S3 method for class 'clarify_est'
cbind(..., deparse.level = 1)
Arguments
| _data | the  | 
| ... | for  | 
| deparse.level | ignored. | 
Details
For transform(), the expression on the right side of the = should use the names of the existing quantities (e.g., `E[Y(1)]` - `E[Y(1)]`), with ` appropriately included when the quantity name include parentheses or brackets. Alternatively, it can use indexes prefixed by .b, e.g., .b2 - .b1, to refer to the corresponding quantity by position. This can aid in computing derived quantities of quantities with complicated names. (Note that if a quantity is named something like .b1, it will need to be referred to by position rather than name, as the position-based label takes precedence). See examples. Setting an existing value to NULL will remove that quantity from the object.
cbind() does not rename the quantities or check for uniqueness of the names, so it is important to rename them yourself prior to combining the objects.
Value
A clarify_est object, either with new columns added (when using transform()) or combining two clarify_est objects. Note that any type attributes corresponding to the sim_apply() wrapper used (e.g., sim_ame()) is lost when using either function. This can affect any helper functions (e.g., plot()) designed to work with the output of specific wrappers.
See Also
Examples
data("lalonde", package = "MatchIt")
# Fit the model
fit <- lm(re78 ~ treat * (age + educ + race +
             married + re74 + re75),
           data = lalonde)
# Simulate coefficients
set.seed(123)
s <- sim(fit, n = 100)
# Average adjusted predictions for `treat` within
# subsets of `race`
est_b <- sim_ame(s, var = "treat", verbose = FALSE,
                 subset = race == "black")
est_b
est_h <- sim_ame(s, var = "treat", verbose = FALSE,
                 subset = race == "hispan")
est_h
# Compute differences between adjusted predictions
est_b <- transform(est_b,
                   diff = `E[Y(1)]` - `E[Y(0)]`)
est_b
est_h <- transform(est_h,
                   diff = `E[Y(1)]` - `E[Y(0)]`)
est_h
# Bind estimates together after renaming
names(est_b) <- paste0(names(est_b), "_b")
names(est_h) <- paste0(names(est_h), "_h")
est <- cbind(est_b, est_h)
est
# Compute difference in race-specific differences
est <- transform(est,
                 `diff-diff` = .b6 - .b3)
summary(est,
        parm = c("diff_b", "diff_h", "diff-diff"))
# Remove last quantity by using `NULL`
transform(est, `diff-diff` = NULL)