Overview. Data transformations are a useful companion for parametric regression models. A well-chosen or learned transformation can greatly enhance the applicability of a given model, especially for data with irregular marginal features (e.g., multimodality, skewness) or various data domains (e.g., real-valued, positive, or compactly-supported data).
Given paired data \((x_i,y_i)\) for
\(i=1,\ldots,n\), SeBR
implements efficient and fully Bayesian inference for semiparametric
regression models that incorporate (1) an unknown data
transformation
\[ g(y_i) = z_i \]
and (2) a useful parametric regression model
\[ z_i = f_\theta(x_i) + \sigma \epsilon_i \]
with unknown parameters \(\theta\) and independent errors \(\epsilon_i\).
Examples. We focus on the following important special cases:
\[ z_i = x_i'\theta + \sigma\epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} N(0, 1) \]
The transformation \(g\) broadens the applicability of this useful class of models, including for positive or compactly-supported data.
\[ z_i = x_i'\theta + \sigma\epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} ALD(\tau) \]
to target the \(\tau\)th quantile of \(z\) at \(x\), or equivalently, the \(g^{-1}(\tau)\)th quantile of \(y\) at \(x\). The ALD is quite often a very poor model for real data, especially when \(\tau\) is near zero or one. The transformation \(g\) offers a pathway to significantly improve the model adequacy, while still targeting the desired quantile of the data.
\[ z_i = f_\theta(x_i) + \sigma \epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} N(0, 1) \]
where \(f_\theta\) is a GP and \(\theta\) parameterizes the mean and covariance functions. Although GPs offer substantial flexibility for the regression function \(f_\theta\), this model may be inadequate when \(y\) has irregular marginal features or a restricted domain (e.g., positive or compact).
Challenges: The goal is to provide fully Bayesian posterior inference for the unknowns \((g, \theta)\) and posterior predictive inference for future/unobserved data \(\tilde y(x)\). We prefer a model and algorithm that offer both (i) flexible modeling of \(g\) and (ii) efficient posterior and predictive computations.
Innovations: Our approach (https://doi.org/10.1080/01621459.2024.2395586) specifies a nonparametric model for \(g\), yet also provides Monte Carlo (not MCMC) sampling for the posterior and predictive distributions. As a result, we control the approximation accuracy via the number of simulations, but do not require the lengthy runs, burn-in periods, convergence diagnostics, or inefficiency factors that accompany MCMC. The Monte Carlo sampling is typically quite fast.
SeBRThe package SeBR is installed and loaded as follows:
# CRAN version:
# install.packages("SeBR")
# Development version: 
# devtools::install_github("drkowal/SeBR")
library(SeBR) The main functions in SeBR are:
sblm(): Monte Carlo sampling for posterior and
predictive inference with the semiparametric Bayesian linear
model;
sbsm(): Monte Carlo sampling for posterior and
predictive inference with the semiparametric Bayesian spline
model, which replaces the linear model with a spline for nonlinear
modeling of \(x \in
\mathbb{R}\);
sbqr(): blocked Gibbs sampling for posterior and
predictive inference with the semiparametric Bayesian quantile
regression; and
sbgp(): Monte Carlo sampling for predictive
inference with the semiparametric Bayesian Gaussian process
model.
Each function returns a point estimate of \(\theta\) (coefficients), point
predictions at some specified testing points
(fitted.values), posterior samples of the transformation
\(g\) (post_g), and
posterior predictive samples of \(\tilde
y(x)\) at the testing points (post_ypred), as well
as other function-specific quantities (e.g., posterior draws of \(\theta\), post_theta). The
calls coef() and fitted() extract the point
estimates and point predictions, respectively.
Note: The package also includes Box-Cox variants of
these functions, i.e., restricting \(g\) to the (signed) Box-Cox parametric
family \(g(t; \lambda) = \{\mbox{sign}(t)
\vert t \vert^\lambda - 1\}/\lambda\) with known or unknown \(\lambda\). The parametric transformation is
less flexible, especially for irregular marginals or restricted domains,
and requires MCMC sampling. These functions (e.g.,
blm_bc(), etc.) are primarily for benchmarking.
Detailed documentation and examples are available at https://drkowal.github.io/SeBR/.
Kowal, D. and Wu, B. (2024). Monte Carlo inference for semiparametric Bayesian regression. JASA. https://doi.org/10.1080/01621459.2024.2395586