Abstract
FLASHMM is a package for single-cell differential expression analysis using linear mixed-effects models (LMMs). Mixed-effects models have become a powerful tool in single-cell studies due to their ability to model intra-subject correlation and inter-subject variability. However, classic LMM estimation methods face limitations in computational speed and memory usage when applied to large-scale single-cell transcriptomics data. The FLASHMM package provides a fast and scalable approach to address scalability and memory efficiency in LMM fitting. This vignette describes the methods for LMM estimation and hypothesis testing, as well as the R functions for implementing these methods, and demonstrates the use of the package through an example.Consider the linear mixed-effects model (LMM) as expressed below (Searle, Casella, and McCulloch 2006) \[\begin{equation}\label{eq:lmm} y = X\beta + Zb + \epsilon, \end{equation}\] where \(y\) is an \(n\times 1\) vector of observed responses, \(X\) is an \(n\times p\) design matrix for fixed effects \(\beta\), \(Z\) is an \(n\times q\) design matrix for random effects \(b\), and \(\epsilon\) is a vector of residual errors. The term of random effects may be a combination of various components, \[ Zb = Z_1 b_1 + \cdots + Z_K b_K, \] where \(Z=[Z_1,\ldots,Z_K]\), \(b=[b^T_1,\ldots,b^T_K]^T\), \(K\) is the number of random-effect components, and \(Z_k\) is an \(n\times q_k\) design matrix for the \(k\)-th random-effect component. The superscript \(T\) denotes a transpose of vector or matrix. The basic assumptions are as follows:
Here \(\mathbf{0}\) is a vector of zero elements, \(I_n\) is an \(n\times n\) identity matrix, and \(\sigma^2_k\) and \(\sigma^2\) are unknown parameters, called variance components. Assumption (1) implies \(p < n\). We also assume \(q_k < n\). If \(q_k\geq n\), we can use principal component analysis (PCA) to obtain an equivalent LMM with the number of random effects less than \(n\).
Hartley and Rao (1967) developed the maximum likelihood (ML) method for estimating the LMM parameters (fixed effects and variance components). Patterson and Thompson (1971) proposed a modified maximum likelihood procedure, known as restricted maximum likelihood (REML) method. Estimating the variance components by either ML or REML is a numerical optimization problem. The log-likelihood based gradient methods have been proposed (Searle, Casella, and McCulloch 2006).
We developed a fast and scalable algorithm for implementing the gradient methods based on summary statistics. Let \(XX\), \(XY\), \(ZX\), \(ZY\), and \(ZZ\) denote a matrix, respectively, which define the summary statistics that are computed from cell-level data \(X\), \(Y\) and \(Z\) as follows: \[\begin{equation} \label{eq:sdata} \begin{array}{l} XX = X^TX, ~XY = X^TY^T,\\ ZX = Z^TX, ~ZY = Z^TY^T, ~ZZ = Z^TZ,\\ Ynorm = [y_1y_1^T, \ldots, y_my_m^T], \end{array} \end{equation}\] where \(Y = [y_1^T, \ldots, y_m^T]^T\) is a \(m\)-by-\(n\) matrix of gene expression profile with each row \(y_i\) corresponding to the expression of gene \(i\), \(i=1,\ldots,m\), \(m\) is the number of genes and \(n\) is the number of cells (sample size). After the summary statistics are precomputed, the summary statistics based algorithm has a complexity of \(O(m(p^3 + q^3))\), which doesn’t depend on the number of cells (sample size \(n\)). In single-cell differential expression (DE) analysis, the numbers of fixed and random effects, \(p\) and \(q\), are relatively small. Therefore, the algorithm is fast and scalable, and requires less computer memory. See the Supplementary Material in Xu et al. (2025) for details.
The FLASHMM package provides two functions for fitting LMMs: lmm and lmmfit. The lmm function takes summary statistics as input, whereas lmmfit is a wrapper around lmm that directly processes cell-level data and computes the summary statistics internally. While lmmfit is easier to use, it has the limitation of higher memory consumption. For extremely large scale data, we can precompute the summary-level data by \(\eqref{eq:sdata}\), and then use lmm function to fit LMMs. FLASHMM provides lmmtest function to perform statistical test on the fixed effects and the contrasts of the fixed effects.
In summary, FLASHMM package provides the following main functions.
We use a simulated multi-sample multi-cell-type single-cell RNA-seq (scRNA-seq) dataset to illustrate how to utilize FLASHMM to perform single-cell differential expression analysis. In this example, we are interested in identifying the genes differentially expressed between two treatments (conditions) within a cell type using a linear mixed-effects model (LMM).
We simulate the multi-sample multi-cell-type scRNA-seq data by simuRNAseq function in FLASHMM package. The simuRNAseq function generates scRNA-seq data with or without a reference dataset. The reference dataset consists of count data and metadata. The count data is a genes-by-cells matrix. The metadata is a data frame containing three columns: samples (subjects), cell-types (clusters), and treatments (experimental conditions), and the three columns must be named ‘sam’, ‘cls’, and ‘trt’, respectively.
The simulated expression count is generated by the negative binomial (NB) distribution with dispersion \(\phi_g\) and mean \(\mu_{g,ijk}\), \[\begin{equation}\label{nbinomsimu} y_{g, ijk}\sim NB(\mu_{g,ijk}, \phi_g), \end{equation}\] \[\log(\mu_{g,ijk}) = \log(\mu_g) + \alpha_{ijk} + \beta_{g, ij} + b_{g,k},\] where \(y_{g, ijk}\) be the count for gene \(g\) and the cell from subject \(k = 1,\ldots, n_s\) and cell-type \(j = 1, \ldots, n_c\) with treatment \(i=1,2\). \(\phi_g\) and \(\mu_g\) are the dispersion and mean estimated from the reference data for gene \(g\), \(\alpha_{ijk}\) is a centered log-library-size for the cell \(c_{ijk}\), \(\beta_{g,ij}\) represents the effect of treatment \(i\) specific to cell-type \(j\), \(b_{g,k} \sim N(0, \sigma^2_b)\) is a random effect of subject variation. For the non-DE genes, \(\beta_{g,1j}=\beta_{g,2j}=0\). For the DE genes, \(\beta_{g,1j} = 0\) and \(\beta_{g, 2j}\sim \pm Uniform(a_1, a_2)\), a uniform distribution with \(a_1a_2>0\), where \(a_1\) and \(a_2\) are the lower and upper bounds of the DE gene effect sizes. See the Supplementary Material in Xu et al. (2025) for details.
For simplicity and demonstration purposes, we use a simulated reference dataset to generate the scRNA-seq data. First we simulate the reference dataset.
##Generate a reference dataset by simuRNAseq function.
set.seed(2502)
refdata <- simuRNAseq(nGenes = 100, nCells = 10000)
counts <- refdata$counts #counts
metadata <- refdata$metadata #metadata
head(metadata)
#>       sam cls trt libsize
#> Cell1  B5   1   B     197
#> Cell2  A3   1   A     182
#> Cell3  A5   9   A     196
#> Cell4  B1   4   B     200
#> Cell5  B5   1   B     169
#> Cell6  A7   1   A     216
rm(refdata)For the simulated reference dataset, we don’t need to change the metadata column names because the columns of samples, clusters, and treatments, are already named ‘sam’, ‘cls’, and ‘trt’, respectively. For a real biological reference dataset, the column names of the metadata should be correspondingly changed as ‘sam’, ‘cls’ and ‘trt’.
Next we use the reference counts and metadata to simulate scRNA-seq data that contains 100 genes and 100,000 cells from 25 samples (subjects) and 10 clusters (cell-types) with 2 treatments. There are totally 10 DE genes specific to a cell-type.
##Generate the scRNA-seq data by simuRNAseq function.
set.seed(2503)
dat <- simuRNAseq(counts, metadata = metadata, nGenes = 100, nCells = 100000, 
       nsam = 25, ncls = 10, ntrt = 2, nDEgenes = 10)
str(dat)
#> List of 5
#>  $ ref.mean.dispersion:'data.frame': 100 obs. of  2 variables:
#>   ..$ mu        : num [1:100] 3.12 2.97 1.11 4.19 1.87 ...
#>   ..$ dispersion: num [1:100] 0.766 0.836 1.602 2.37 1.855 ...
#>  $ metadata           :'data.frame': 100000 obs. of  4 variables:
#>   ..$ sam    : chr [1:100000] "A13" "B6" "A9" "B9" ...
#>   ..$ cls    : chr [1:100000] "2" "8" "3" "4" ...
#>   ..$ trt    : chr [1:100000] "A" "B" "A" "B" ...
#>   ..$ libsize: num [1:100000] 177 238 145 170 209 193 143 172 227 187 ...
#>  $ counts             : num [1:100, 1:100000] 5 4 0 2 3 9 0 0 0 1 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:100] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>   .. ..$ : chr [1:100000] "Cell9476" "Cell774" "Cell5736" "Cell6735" ...
#>  $ DEgenes            :'data.frame': 10 obs. of  3 variables:
#>   ..$ gene   : chr [1:10] "Gene91" "Gene92" "Gene93" "Gene94" ...
#>   ..$ beta   : num [1:10] 0.649 -0.87 -0.752 0.849 -0.499 ...
#>   ..$ cluster: chr [1:10] "1" "10" "10" "2" ...
#>  $ treatment          : chr "B"
##Samples (subjects) nested in one treatment A or B
#table(dat$metadata$sam)
#table(dat$metadata$trt)
all(grep("A", dat$metadata$sam) %in% which(dat$metadata$trt == "A"))
#> [1] TRUE
all(grep("B", dat$metadata$sam) %in% which(dat$metadata$trt == "B"))
#> [1] TRUE
rm(counts, metadata) #releasing memoryThe simulated data contains
We perform differential expression (DE) analysis of the simulated data using FLASHMM package. We are to identify the significantly differentially expressed genes between two treatments in a cell-type based on the LMM. The gene expression profile is taken as log-transformed count matrix, \[\begin{equation}\label{exprs} Y = \log(1+\text{counts}), \end{equation}\] in which each row represents the expression profile for a gene. The analyses involve following steps: LMM design, LMM fitting, and hypothesis testing.
LMM provides a framework to represent the single-cell gene expression profile by incorporating fixed effects, which capture systematic differences across experimental conditions, and random effects, which model the correlations within subjects and the variations between subjects. Constructing design matrices for fixed and random effects is an important step in LMM-based DE analysis.
Design matrix for fixed effects: As a general linear model design, the fixed effect design matrix can be created by model.matrix function as follows \[\begin{equation}\label{fixedeff} X = model.matrix(\sim 0 + log(library.size) + covariates + cell.type + cell.type:treatment), \end{equation}\] where
Design matrix for random effects: In multi-subject single-cell studies, cells are nested within subjects, and subjects are often nested within experimental conditions, meaning that cells from the same subject are correlated, and subjects within the same condition share common sources of variation. To account for this hierarchical structure, the mixed models treat the subjects (samples) as random effects, efficiently capturing both within-subject correlation and between-subject variability. In many scRNA-seq studies, samples are perfectly confounded with experimental conditions. In such cases, including subjects as a fixed effect may introduce collinearity, especially when the number of subjects is large.
The design matrix for the single-component random effects can be created by \[\begin{equation}\label{randomeff1} Z = model.matrix(\sim 0 + subject) \end{equation}\] with a dimension \(d = ncol(Z)\). If appropriate, for example, we also take account of the measurement time as a random effect within a subject, we may consider the two-component random effects with design matrix given by \[\begin{equation}\label{randomeff2} Z = model.matrix(\sim 0 + subject + subject:time) \end{equation}\] with dimensions \(d = c(ncol(Z)/2, ncol(Z)/2)\), a 2-vector, where \(ncol(Z)/2\) equals to the number of subjects.
## Gene expression matrix, Y = log2(1 + counts)
Y <- log2(1 + dat$counts)
dat$counts <- NULL  #releasing memory
## Since the simulated data contains no covariates other than libsize (library size) and
## cls (clsters or cell types), the design matrix for fixed effects is given as follows.
X <- model.matrix(~0 + log(libsize) + cls + cls:trt, data = dat$metadata)
## Note that the samples in the simulated data are nested in one treatment A or B. Design
## matrix for single-component random effects is given by
Z <- model.matrix(~0 + as.factor(sam), data = dat$metadata)
d <- ncol(Z)
## Suppose the data contains different measurement time points within a sample, denoted
## as 'time', which are randomly generated. To account for the variability of the time
## within a sample, we may use the two-component random effects design matrix:
set.seed(250801)
n <- nrow(X)
dat$metadata$time <- runif(n, 1, 1.5) * sample(1:2, n, replace = TRUE)
Za <- model.matrix(~0 + sam + sam:time, data = dat$metadata)
da <- c(ncol(Za)/2, ncol(Za)/2)  #dimension
range(Za[, 1:d] - Z)  #identical to the single-component design
#> [1] 0 0
rm(dat)  #releasing memoryWe use either lmm or lmmfit function to fit LMMs. With the cell-level data matrices \(Y\), \(X\) and \(Z\), the LMMs can be fit by \[lmmfit(Y, X, Z, d = d),\] where \(d\) is a vector of the numbers of random-effects in different components. For \(K\) components of random-effects, \(d = (q_1, \ldots, q_K)\), where \(q_k\) is the number of columns of the design matrix \(Z_k\) for the \(k\)-th random-effect component.
The lmmfit function has high memory usage. For extremely large-scale data, it is recommended to pre-compute and store the summary-level data: \(XX\), \(XY\), \(ZX\), \(ZY\), \(ZZ\), and \(Ynorm\), defined in \(\eqref{eq:sdata}\), and then use the lmm function to fit the LMMs as follows: \[lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d).\] The summary-level data doesn’t depend on the sample size \(n\). This makes lmm memory-efficient. By default, the lmm and lmmfit functions use the restricted maximum likelihood (REML) method (method = ‘REML’). To fit the LMM using maximum likelihood (ML), set method = ‘ML’.
##Option 1: Fit LMM based on cell-level data.
max.iter <- 100
epsilon <- 1e-5
##Fit the LMM with one-component random effects.
fit1 <- lmmfit(Y, X, Z, d = d, max.iter = max.iter, epsilon = epsilon)
##Fit the LMM with two-component random effects.
fit2 <- lmmfit(Y, X, Za, d = da, max.iter = max.iter, epsilon = epsilon)
##Option 2: Fit LMM based on summary-level data.
##(1) Compute the summary-level data.
n <- nrow(X)
XX <- t(X)%*%X
XY <- t(Y%*%X)
ZX <- t(Z)%*%X
ZY <- t(Y%*%Z)
ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)
##The summary-level data can also be computed by sslmm function:
ss <- sslmm(X, Y, Z)
XX <- ss$XX; XY <- ss$XY
ZX <- ss$ZX; ZY <- ss$ZY; ZZ <- ss$ZZ
n <- ss$n; Ynorm <- ss$Ynorm
rm(X, Y, Z, Za) #releasing memory.
##(2) Fit LMM using lmm function.
fitss <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d, 
             max.iter = max.iter, epsilon = epsilon)
##The two LMM fits are identical.
identical(fit1, fitss) 
#> [1] TRUE
rm(fitss)
str(fit1)
#> List of 13
#>  $ method  : chr "REML"
#>  $ dlogL   : num [1:2, 1:100] 5.63e-10 -1.82e-11 9.78e-07 -4.80e-08 7.37e-08 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:100] "dl" "dl" "dl" "dl" ...
#>  $ logLik  : num [1:100] -163060 -161757 -120461 -140372 -132523 ...
#>  $ niter   : num [1:100] 4 4 4 4 4 4 4 4 4 4 ...
#>  $ coef    : num [1:21, 1:100] 0.837 -2.997 -2.995 -2.995 -2.977 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:21] "log(libsize)" "cls1" "cls10" "cls2" ...
#>   .. ..$ : chr [1:100] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ se      : num [1:21, 1:100] 0.0239 0.1415 0.1418 0.1415 0.1416 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:21] "log(libsize)" "cls1" "cls10" "cls2" ...
#>   .. ..$ : chr [1:100] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ t       : num [1:21, 1:100] 35 -21.2 -21.1 -21.2 -21 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:21] "log(libsize)" "cls1" "cls10" "cls2" ...
#>   .. ..$ : chr [1:100] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ p       : num [1:21, 1:100] 7.05e-267 2.54e-99 9.14e-99 3.56e-99 5.69e-98 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:21] "log(libsize)" "cls1" "cls10" "cls2" ...
#>   .. ..$ : chr [1:100] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ cov     : num [1:21, 1:21, 1:100] 0.000572 -0.003025 -0.003033 -0.003025 -0.003027 ...
#>   ..- attr(*, "dimnames")=List of 3
#>   .. ..$ : chr [1:21] "log(libsize)" "cls1" "cls10" "cls2" ...
#>   .. ..$ : chr [1:21] "log(libsize)" "cls1" "cls10" "cls2" ...
#>   .. ..$ : chr [1:100] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ df      : int 99979
#>  $ theta   : num [1:2, 1:100] 0.0483 1.5264 0.073 1.4869 0.0374 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:2] "var1" "var0"
#>   .. ..$ : chr [1:100] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ se.theta: num [1:2, 1:100] 0.01435 0.00683 0.02164 0.00665 0.01106 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:2] "var1" "var0"
#>   .. ..$ : chr [1:100] "Gene1" "Gene2" "Gene3" "Gene4" ...
#>  $ RE      : NULLThe outputs from LMM fitting contain:
Convergence of LMM fitting: The LMM fitting is considered converged if the absolute values of first partial derivatives of the log-likelihood are smaller than the convergence tolerance (epsilon), specified in the lmm and lmmfit functions; otherwise, the fitting does not converge. The genes for which LMM fitting doesn’t converge should be excluded in the subsequent analysis because the estimated coefficients for these genes are not reliable.
Testing of fixed effects using the lmmtest function
Although lmm and lmmfit functions return t-values and p-values for testing the fixed effects (coefficients), we can also use lmmtest function to conduct statistical tests on the fixed effects and their contrasts for various comparisons between different levels.
##Testing coefficients (fixed effects)
test <- lmmtest(fit1)
##The t-value and p-values are identical with those provided in the LMM fit.
range(test - cbind(t(fit1$coef), t(fit1$t), t(fit1$p)))
#> [1] 0 0
fit1$coef[, 1:4]
#>                    Gene1        Gene2        Gene3      Gene4
#> log(libsize)  0.83712370  0.815346582  0.609785453  1.0811460
#> cls1         -2.99711303 -2.770377841 -2.443269081 -3.7210080
#> cls10        -2.99477580 -2.770040504 -2.429391852 -3.7073470
#> cls2         -2.99466685 -2.743763947 -2.435933647 -3.7256140
#> cls3         -2.97706542 -2.754190732 -2.432881943 -3.7190633
#> cls4         -2.97413935 -2.755044022 -2.455657052 -3.7082038
#> cls5         -2.99039453 -2.762545257 -2.427748585 -3.7123421
#> cls6         -2.98074999 -2.747882597 -2.438762777 -3.7219514
#> cls7         -2.98944217 -2.793347458 -2.446139670 -3.7233747
#> cls8         -2.94949402 -2.728670124 -2.424298237 -3.7448112
#> cls9         -2.97601511 -2.728735173 -2.445063375 -3.7358832
#> cls1:trtB     0.11700126 -0.017459519  0.005427563 -0.1726133
#> cls10:trtB    0.13519741 -0.018591565  0.025108329 -0.1701416
#> cls2:trtB     0.14652033 -0.032295910 -0.006637487 -0.1376565
#> cls3:trtB     0.12578238 -0.057134984 -0.012688736 -0.1467455
#> cls4:trtB     0.09257895 -0.036546217  0.040026452 -0.1472434
#> cls5:trtB     0.13726104 -0.051430048  0.010164625 -0.1592534
#> cls6:trtB     0.09969125 -0.029924922  0.007534469 -0.1434095
#> cls7:trtB     0.13224193  0.004360562  0.028405146 -0.1716329
#> cls8:trtB     0.10008302 -0.069260573 -0.004854496 -0.1530307
#> cls9:trtB     0.13244395 -0.073164417  0.032615275 -0.1311506
#fit1$t[, 1:4]
#fit1$p[, 1:4]Differentially expressed (DE) genes: The coefficients of the interaction term cls\(\ast\):trtB represent the effects of treatment B versus A in a cell-type (cls\(\ast\)). The DE genes between two treatments in a cell-type can be identified based on the false discovery rate (FDR) or family-wise error rate (Bonferroni correction). Additionally, we may exclude genes with small coefficients (effect size or log-fold change).
##The DE genes specific to a cell-type
##Coefficients, t-values, and p-values for the genes specific to a cell-type.
index <- grep(":", rownames(fit1$coef))
ce <- fit1$coef[index, ]
tv <- fit1$t[index, ]
pv <- fit1$p[index, ]
out <- data.frame(
    gene = rep(colnames(ce), nrow(ce)), 
    cluster = rep(rownames(ce), each = ncol(ce)),
    coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))
##FDR.
out$FDR <- p.adjust(out$p, method = "fdr")
##The DE genes with FDR < 0.05 and abs(logFC) > 0.5
out <- out[order(out$p), ]
rownames(out) <- NULL
out[(out$FDR < 0.05) & (abs(out$coef) > 0.5) , ]
#>     gene    cluster       coef         t            p          FDR
#> 1 Gene94  cls2:trtB  0.9623800  9.388327 6.218300e-21 6.218300e-18
#> 2 Gene97  cls3:trtB -0.7280366 -8.058266 7.822917e-16 3.911459e-13
#> 3 Gene92 cls10:trtB -0.7872053 -7.552437 4.307974e-14 1.435991e-11
#> 4 Gene96  cls3:trtB  0.7869040  7.170842 7.505135e-13 1.876284e-10
#> 5 Gene98  cls5:trtB -0.7950199 -6.530485 6.586976e-11 1.317395e-08
#> 6 Gene99  cls6:trtB  0.8171681  6.479861 9.223164e-11 1.537194e-08Using contrast: We can make a comparison of the treatment B versus A across cell-types via the contrast that sums the treatment effects across cell-types as follows.
##Construct the contrast.
contrast <- cbind("BvsA" = numeric(nrow(fit1$coef)))
index <- grep(":", rownames(fit1$coef))
contrast[index, ] <- 1/length(index)
length(index)
#> [1] 10
##Test the contrast.
test <- lmmtest(fit1, contrast = contrast)
head(test)
#>         BvsA_coef     BvsA_t    BvsA_p
#> Gene1  0.12188015  1.3803539 0.1674808
#> Gene2 -0.03814476 -0.3517828 0.7250019
#> Gene3  0.01251011  0.1613397 0.8718262
#> Gene4 -0.15328775 -1.1955128 0.2318896
#> Gene5  0.11304765  1.1300582 0.2584544
#> Gene6 -0.01584361 -0.1642750 0.8695150The contrast can also be constructed by contrast.matrix function as follows.
ncls <- 10 #length(index)
sumeff <- paste0(paste0("cls", 1:ncls, ":trtB"), collapse = "+")
sumeff <- paste0("(", sumeff, ")/", ncls)
#sumeff
contrast <- contrast.matrix(
            contrast = c(BvsA = sumeff), 
            model.matrix.names = rownames(fit1$coef))
test <- lmmtest(fit1, contrast = contrast)
head(test)
#>         BvsA_coef     BvsA_t    BvsA_p
#> Gene1  0.12188015  1.3803539 0.1674808
#> Gene2 -0.03814476 -0.3517828 0.7250019
#> Gene3  0.01251011  0.1613397 0.8718262
#> Gene4 -0.15328775 -1.1955128 0.2318896
#> Gene5  0.11304765  1.1300582 0.2584544
#> Gene6 -0.01584361 -0.1642750 0.8695150Testing of variance component in the LMM with single-component random effects
Since the simulated data is generated by the LMM with single-component random effects, see Simulating scRNA-seq data Section, the variance component of random effects should not be zero. Next we use z-statistic \(\eqref{zvarcomp}\) to test the variance component. Note that ‘fit1’ is the output of fitting the LMM with single-component random effects using lmmfit function, see LMM fitting Section.
##Using z-test for testing variance components
i <- grep("var1", rownames(fit1$theta))  #The (first) variance component
z <- fit1$theta[i, ]/fit1$se.theta[i, ]   #z-statistics
##One-sided z-test p-values for hypotheses:
##H0: theta <= 0 vs H1: theta > 0
p <- pnorm(z, lower.tail = FALSE)
##Number of significant p-values
sum(p < 0.05/length(p)) #Bonferroni correction
#> [1] 100Testing of variance component in the LMM with two-component random effects
As described in LMM design Section, the LMM with two-component random effects has two variance components, the first one for random effects of subjects and the second one for random effects of measurement times within a subject. Since the simulated data was generated by the LMM with one-component random effects (i.e., subjects), the second variance component should be zero. Next we use both z-statistic \(\eqref{zvarcomp}\) and LRT statistic \(\eqref{lrtvarcomp}\) to test the second variance component of the random effects. Note that ‘fit2’ is the output of fitting the LMM with two-component random effects using lmmfit function, see LMM fitting Section.
##(1) z-test for testing the second variance component
##Z-statistics for the second variance component
i <- grep("var2", rownames(fit2$theta)) 
z <- fit2$theta[i, ]/fit2$se.theta[i, ] 
##One-sided z-test p-values for hypotheses:
##H0: theta <= 0 vs H1: theta > 0
p <- pnorm(z, lower.tail = FALSE)
##number of significant p-values
sum(p.adjust(p, method = "fdr") < 0.05)
#> [1] 0
sum(p < 0.05/length(p)) #Bonferroni correction
#> [1] 0
##(2) LRT for testing the second variance component
LRT <- 2*(fit2$logLik - fit1$logLik)
pLRT <- pchisq(LRT, df = 1, lower.tail = FALSE)
##number of significant p-values
sum(p.adjust(pLRT, method = "fdr") < 0.05)
#> [1] 0
sum(pLRT < 0.05/length(pLRT)) #Bonferroni correction
#> [1] 0
##QQ-plot
qqplot(runif(length(p)), p, xlab = "Uniform quantile", ylab = "Z-test p-value", col = "blue")
abline(0, 1, col = "gray")qqplot(runif(length(pLRT)), pLRT, xlab = "Uniform quantile", ylab = "LRT p-value", col = "blue")
abline(0, 1, col = "gray")To use the maximum likelihood (ML) method to fit the LMM, set method = ‘ML’ in the lmm and lmmfit functions.
##Fitting LMM using ML method
#fit3 <- lmmfit(Y, X, Z, d = d, method = "ML", max.iter = max.iter, epsilon = epsilon)
fit3 <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d, method = "ML",
             max.iter = max.iter, epsilon = epsilon)
##The DE genes specific to a cell-type
##Coefficients, t-values, and p-values
index <- grep(":", rownames(fit3$coef))
ce <- fit3$coef[index, ]
tv <- fit3$t[index, ]
pv <- fit3$p[index, ]
out <- data.frame(
    gene = rep(colnames(ce), nrow(ce)), 
    cluster = rep(rownames(ce), each = ncol(ce)),
    coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))
##The DE genes with FDR < 0.05 and abs(logFC) > 0.5
out$FDR <- p.adjust(out$p, method = "fdr") #FDR
out <- out[order(out$p), ]
rownames(out) <- NULL
out[(out$FDR < 0.05) & (abs(out$coef) > 0.5) , ]
#>     gene    cluster       coef         t            p          FDR
#> 1 Gene94  cls2:trtB  0.9623788  9.772953 1.505634e-22 1.505634e-19
#> 2 Gene97  cls3:trtB -0.7280336 -8.373840 5.647997e-17 2.823999e-14
#> 3 Gene92 cls10:trtB -0.7872121 -7.855767 4.012553e-15 1.337518e-12
#> 4 Gene96  cls3:trtB  0.7869033  7.467752 8.223598e-14 2.055899e-11
#> 5 Gene98  cls5:trtB -0.7950228 -6.801863 1.038529e-11 2.077058e-09
#> 6 Gene99  cls6:trtB  0.8171705  6.749672 1.489838e-11 2.483064e-09Hartley and Rao (1967) developed the maximum likelihood (ML) method for estimating the LMM parameters (fixed effects and variance components). Patterson and Thompson (1971) proposed a modified maximum likelihood procedure, known as restricted maximum likelihood (REML), which partitions the data into two mutually uncorrelated parts, one being free of the fixed effects used for estimating variance components. The REML estimator is unbiased, whereas the ML estimator of variance components is generally biased. With variance components, \(\theta\), estimated, the fixed effects estimated by either ML or REML are given as follows \[ \hat\beta = (X^TV_{\theta}^{-1}X )^{-1}X^TV_{\theta}^{-1}y, \] with covariance matrix \[var(\hat\beta) = (X^TV_{\theta}^{-1}X)^{-1},\] where \(\theta = (\theta_0, \theta_1,\ldots, \theta_K)\), \(\theta_0 = \sigma^2\), \(\theta_k = \sigma^2_k\), and \[V_{\theta} = \theta_0 I_n + \theta_1 Z_1Z_1^T + \ldots + \theta_K Z_KZ_K^T.\]
Estimating the variance components by either ML or REML is a numerical optimization problem. Various iterative methods based on log likelihood, called gradient methods, have been proposed (Searle, Casella, and McCulloch 2006). The gradient methods are represented by the iteration equation \[\begin{equation}\label{eq:gradient} \theta^{(i+1)} = \theta^{(i)} + \Gamma(\theta^{(i)})\frac{\partial l(\theta^{(i)})}{\partial\theta}, \end{equation}\] where \(\partial l(\theta)/\partial\theta\) is the gradient of the log likelihood function, and \(\Gamma(\theta)\) is a modifier matrix of the gradient direction, which can be specified by Newton–Raphson, Fisher scoring or average information. The Newton-Raphson method uses the inverse of the negative Hessian matrix (observed Fisher information) as a modifier, while the Fisher scoring method uses the inverse of Fisher information matrix (the expectation of negative Hessian matrix). The average information method uses the inverse of the average of the negative Hessian matrix and its expectation. See the Supplementary Material in Xu et al. (2025) for details. The Fisher scoring method is more stable than others because the negative Hessian matrix may not be positive definite but its expectation is positive definite.
The hypotheses for testing fixed effects and variance components can be respectively defined as \[ H_{0, i}: \beta_i = 0 ~~\text{versus}~~H_{1,i}: \beta_i\ne 0, \] \[ H_{0, k}: \theta_k \leq 0 ~~\text{versus}~~H_{1,k}: \theta_k > 0, \] where \(\theta_k\), \(k=1, \ldots, K\), represent the parameters of variance components \(\sigma^2_k\) but are allowed to be negative. The lower boundary of the parameters, \(\theta\), can be the negative value such that the variance-covariance matrix, \(V_{\theta}\), remains definable (positive definite). The negative lower boundary exists and can be less than or equal to \(- \sigma^2/\lambda_{max}\), where \(\lambda_{max} > 0\) is the largest singular value of \(ZZ^T\) or \(Z^TZ\). If \(\theta_k > 0\), then \(\sigma_k^2 = \theta_k\) is definable and the mixed model is well-specified. Otherwise, if \(\theta_k \le 0\), the mixed model is miss-specified and the \(k\)-th random-effect component shouldn’t be included.
Allowing the parameters of variance components to take negative value avoids the zero boundary at the null hypothesis for the variance components. Consequently, the asymptotic normality of the maximum likelihood estimates at the null hypothesis holds under regularity conditions, which enables us to use z-statistics or t-statistics for hypothesis testing of the fixed effects and variance components. The t-statistics for fixed effects are given by \[\begin{equation} \label{eq:tcoef} T_i = \frac{\hat\beta_i}{\sqrt{var(\hat\beta_i)}} = \frac{\hat\beta_i}{\sqrt{var(\hat\beta)_{ii}}} ~\sim ~t(n - p). \end{equation}\] The t-statistic for a contrast, a linear combination of the estimated fixed effects, \(c^T\hat\beta\), is \[\begin{equation} \label{eq:tcontrast} T_c = \frac{c^T\hat\beta}{\sqrt{c^Tvar(\hat\beta) c}} \sim t(n-p). \end{equation}\] The z-statistics for the parameters of variance components are given by \[\begin{equation} \label{zvarcomp} Z_{\theta_k} = \frac{\hat\theta_k}{\sqrt{[I(\hat\theta)^{-1}]_{kk}}} \sim N(0, 1), \end{equation}\] where \(I(\theta)\) is the Fisher information matrix.
We can also use likelihood ratio test (LRT) statistics to test the variance components. Let \(\theta_{sub}\) be the sub-vector of \(\theta\) that is not equal to zero. Let \(L(\hat\theta_{sub})\) and \(L(\hat\theta)\) be the maximum likelihood for the sub-model and the full model, respectively. Note that \(L(\hat\theta_{sub})\) and \(L(\hat\theta)\) depend on \(\hat\beta\) for the ML method which is suppressed for simplicity of notation. The LRT statistic, \[\begin{equation}\label{lrtvarcomp} \lambda_{LR} = 2[\log(L(\hat\theta)) - \log(L(\hat\theta_{sub}))] \sim\chi^2_r, \end{equation}\] an asymptotic \(\chi^2\) distribution, where \(r\) is the number of zero variance components. The LRT requires that both sub and full models are fitted by the same method, either ML or REML, and the design matrix of fixed effects must be the same for both sub and full models when using REML method to fit the models.
Building design matrices for fixed and random effects is a key step in LMM-based differential expression analysis. Including library size, a normalization factor for scRNA-seq, as a fixed effect can help reduce p-value inflation. If necessary, we can add principal components as fixed effects to further mitigate unknown technical effects.
In differential expression analysis, modeling samples (subjects) as random effects accounts for between-sample variability and within-sample correlation. If samples have different effects on gene expression, we can also model them as fixed effects. However, this can lead to overfitting when the number of samples (subjects) is large. For scRNA-seq data, this may also result in collinearity when the samples are nested within an experimental condition.
sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Monterey 12.7.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] FLASHMM_1.2.3
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.37     R6_2.6.1          fastmap_1.2.0     Matrix_1.7-3     
#>  [5] xfun_0.52         lattice_0.22-7    cachem_1.1.0      knitr_1.50       
#>  [9] htmltools_0.5.8.1 rmarkdown_2.29    lifecycle_1.0.4   cli_3.6.5        
#> [13] grid_4.4.3        sass_0.4.10       jquerylib_0.1.4   compiler_4.4.3   
#> [17] rstudioapi_0.17.1 tools_4.4.3       evaluate_1.0.4    bslib_0.9.0      
#> [21] yaml_2.3.10       formatR_1.14      rlang_1.1.6       jsonlite_2.0.0   
#> [25] MASS_7.3-65