normalizeBatch {cydar}R Documentation

Normalize intensities across batches

Description

Perform normalization to correct intensities across batches with at least one common level.

Usage

normalizeBatch(batch.x, batch.comp, mode="range", p=0.01, 
    target=NULL, markers=NULL, ...)

Arguments

batch.x

A list, where each element is of the same type as x used in prepareCellData (i.e., a ncdfFlowSet or a list of intensity matrices across all samples).

batch.comp

A list of factors (or elements coercible to factors) specifying the composition of each batch, i.e., which samples belong to which groups. Also can be NULL, see below.

mode

A string or character vector of length equal to the number of markers, specifying whether range-based or warping normalization should be performed for each marker. This can take values of "range", "warp" or "none" (in which case no normalization is performed).

p

A numeric scalar between 0 and 0.5, specifying the percentile used to define the range of the distribution for range-based normalization.

target

An integer scalar indicating the reference batch.

markers

A character vector specifying the markers to be normalized and returned.

...

Additional arguments to be passed to warpSet for mode="warp".

Details

Consider an experiment containing several batches of barcoded samples, in which the barcoding was performed within but not between batches. This function normalizes the intensities for each marker such that they are comparable between samples in different batches. The process for each marker is as follows:

  1. Weighting is performed to downweight the contribution of larger samples within each batch, as well as to match the composition of samples across different batches. The composition of each batch can be specified by batch.comp, see below for more details. The weighted intensities for each batch represents the pooled distribution of intensities from all samples in that batch.

  2. If mode="range", a quantile function is constructed for the pooled distribution of each batch. These functions are averaged across batches to obtain a reference quantile function, representing a reference distribution. The range of the reference distribution is computed at percentiles p and 1-p (to avoid distortions due to outliers). A batch-specific scaling function is defined to equalize the range of the weighted distribution of intensities from each batch to the reference range.

  3. If mode="warp", weighted sampling from each pooled distribution is performed to generate a pseudo-sample for each batch. This is used to construct a flowSet for use in warping normalization - see ?normalization and ?warpSet for details. A warping function is computed for each batch that adjust the intensity distribution to be more similar to the reference (constructed by averaging across batches).

  4. The scaling or warping function is applied to the intensities of all samples in that batch, yielding corrected intensities for direct comparisons between samples.

Groupings can be specified as batch-specific factors in batch.comp, with at least one common group required across all batches. If the composition of each batch is the same, batch.comp can be set to NULL rather than being manually specified. This composition is used to weight the contribution of each sample to the reference distribution. For example, a batch with more samples in group A and fewer samples in group B would get lower weights assigned to the former and larger weights to the latter.

Construction of the adjustment function relies on the presence of samples from the same group across the different batches. Ideally, all batches would contain samples from all groups, with similar total numbers of cells across batches for each group. The adjustment function will still be applied to intensities for samples from non-shared groups that do not contribute to the reference distribution. However, note that the adjustment may not be accurate if the to-be-corrected intensities lie outside the range of values used to construct the function.

By default, the reference distribution for each marker is defined as an average of the relevant statistic across batches. If target is not NULL, the specified batch will be used as the reference distribution. This means that if mode="range", the reference quantile function will be defined as the quantile function of the chosen batch. Similarly, if mode="warp", warpSet will align all other batches to the locations of the peaks in target.

All markers are used by default when markers=NULL. If markers is specified, only the specified markers will be normalized and returned in the output expression matrices. This is usually more convenient than subsetting the inputs or outputs manually.

To convert the output into a format appropriate for prepareCellData, apply unlist with recursive=FALSE. This will generate a list of intensity matrices for all samples in all batches, rather than a list of list of matrices. Note that a batch effect should still be included in the design matrix when modelling abundances, as only the intensities are corrected here.

Value

A list of lists, where each internal list corresponds to a batch and contains intensity matrices corresponding to all samples in that batch. This matches the format of batch.x.

Choosing between normalization methods

Warping normalization can be more powerful than range-based normalization, as the former can eliminate non-linear changes to the intensities whereas the latter cannot. However, it requires that landmarks in the intensity distribution (i.e., peaks) be easily identifiable and consistent across batches. Large differences (e.g., a peak present in one batch and absent in another) may lead to incorrect adjustments.

Such differences may be present when batches are confounded with uninteresting biological factors (e.g., individual, mouse of origin) that affect cell abundance. In such cases, range-based normalization with mode="range" is recommended as it is more constrained in how the intensities are adjusted. This reduces the risk of distorting the intensities, albeit at the cost of “under-normalizing” the data.

It is advisable to inspect the intensity distributions before and after normalization, to ensure that the methods have behaved appropriately. This can be done by constructing histograms for each marker with multiIntHist. See also diffIntDistr for a quantitative measure of similarity between distributions.

Author(s)

Aaron Lun

See Also

prepareCellData, diffIntDistr, multiIntHist, normalization, warpSet

Examples

### Mocking up some data: ###
nmarkers <- 10
marker.names <- paste0("X", seq_len(nmarkers))
all.x <- list()

for (b in paste0("Batch", 1:3)) { # 3 batches
    nsamples <- 10
    sample.names <- paste0("Y", seq_len(nsamples))
    trans.shift <- runif(nmarkers, 0, 1)
    trans.grad <- runif(nmarkers, 1, 2)
    x <- list()
    for (i in sample.names) {
        ex <- matrix(rgamma(nmarkers*1000, 2, 2), nrow=nmarkers)
        ex <- t(ex*trans.grad + trans.shift)
        colnames(ex) <- marker.names
        x[[i]] <- ex
    }   
    all.x[[b]] <- x
}

batch.comp <- list( # Each batch contains different composition/ordering of groups
    factor(rep(1:2, c(3,7))),
    factor(rep(1:2, c(7,3))),
    factor(rep(1:2, 5))
)

### Running the function: ###
corrected <- normalizeBatch(all.x, batch.comp, mode="range")
par(mfrow=c(1,2))
plot(ecdf(all.x[[1]][[3]][,1]), col="blue", main="Before")
plot(ecdf(all.x[[2]][[3]][,1]), add=TRUE, col="red")
plot(ecdf(corrected[[1]][[3]][,1]), col="blue", main="After")
plot(ecdf(corrected[[2]][[3]][,1]), add=TRUE, col="red")

# Similar effects with warping normalization. 
if (.Platform$OS.type!="windows") {
    wcorrected <- normalizeBatch(all.x, batch.comp, mode="warp")
}

[Package cydar version 1.4.0 Index]