pcair {GENESIS} | R Documentation |
pcair
is used to perform a Principal Components Analysis using genome-wide SNP data for the detection of population structure in a sample. Unlike a standard PCA, PC-AiR accounts for sample relatedness (known or cryptic) to provide accurate ancestry inference that is not confounded by family structure.
pcair(genoData, v = 20, kinMat = NULL, kin.thresh = 2^(-11/2), divMat = NULL, div.thresh = -2^(-11/2), unrel.set = NULL, scan.include = NULL, snp.include = NULL, chromosome = NULL, snp.block.size = 10000, MAF = 0.01, verbose = TRUE) ## S3 method for class 'pcair' print(x, ...) ## S3 method for class 'pcair' summary(object, ...) ## S3 method for class 'summary.pcair' print(x, ...)
genoData |
An object of class |
v |
The number of principal components to be returned; the default is 20. If |
kinMat |
An optional symmetric matrix of pairwise kinship coefficients for every pair of individuals in the sample (the values on the diagonal do not matter, but the upper and lower triangles must both be filled) used for partitioning the sample into the 'unrelated' and 'related' subsets. See 'Details' for how this interacts with |
kin.thresh |
Threshold value on |
divMat |
An optional symmetric matrix of pairwise divergence measures for every pair of individuals in the sample (the values on the diagonal do not matter, but the upper and lower triangles must both be filled) used for partitioning the sample into the 'unrelated' and 'related' subsets. See 'Details' for how this interacts with |
div.thresh |
Threshold value on |
unrel.set |
An optional vector of IDs for identifying individuals that are forced into the unrelated subset. See 'Details' for how this interacts with |
scan.include |
A vector of IDs for samples to include in the analysis. If NULL, all samples are included. |
snp.include |
A vector of SNP IDs to include in the analysis. If NULL, see |
chromosome |
A vector of integers specifying which chromosomes to analyze. This parameter is only considred when |
snp.block.size |
The number of SNPs to read-in/analyze at once. The default value is 10000. |
MAF |
Minor allele frequency filter; any SNPs with MAF less than this value will be excluded from the analysis; the default value is 0.01. |
verbose |
Logical indicator of whether updates from the function should be printed to the console; the default is TRUE. |
object |
An object of class ' |
x |
An object of class ' |
... |
Further arguments passed to or from other methods. |
The basic premise of PC-AiR is to partition the entire sample of individuals into an ancestry representative 'unrelated subset' and a 'related set', perform standard PCA on the 'unrelated subset', and predict PC values for the 'related subset'.
We recommend using software that accounts for population structure to estimate pairwise kinship coefficients to be used in kinMat
. Any pair of individuals with a pairwise kinship greater than kin.thresh
will be declared 'related.' Kinship coefficient estimates from the KING-robust software are used as measures of ancestry divergence in divMat
. Any pair of individuals with a pairwise divergence measure less than div.thresh
will be declared ancestrally 'divergent'. Typically, kin.thresh
and div.thresh
are set to be the amount of error around 0 expected in the estimate for a pair of truly unrelated individuals.
If divMat = NULL
and kinMat
is specified, the kinship coefficient estimates in kinMat
will also be used as divergence measures in place of divMat
.
It is important that the order of individuals in the matrices kinMat
and divMat
match the order of individuals in the genoData
.
There are multiple ways to partition the sample into an ancestry representative 'unrelated subset' and a 'related subset'. If kinMat
is specified and unrel.set = NULL
, then the PC-AiR algorithm is used to find an 'optimal' partition (see 'References' for a paper describing the algorithm). If kinMat = NULL
and unrel.set
is specified, then the individuals with IDs in unrel.set
are used as the 'unrelated subset'. If both kinMat
and unrel.set
are specified, then all individuals with IDs in unrel.set
are forced in the 'unrelated subset' and the PC-AiR algorithm is used to partition the rest of the sample; this is especially useful for including reference samples of known ancestry in the 'unrelated subset'. If kinMat = NULL
and unrel.set = NULL
, then a standard principal components analysis that does not account for relatedness is performed.
An object of class 'pcair
'. A list including:
vectors |
A matrix of the top |
values |
A vector of eigenvalues matching the top |
sum.values |
The sum of all the eigenvalues from the standard PCA run on the 'unrelated subset' (regardless of how many were returned). |
rels |
A vector of IDs for individuals in the 'related subset'. |
unrels |
A vector of IDs for individuals in the 'unrelated subset'. |
kin.thresh |
The threshold value used for declaring each pair of individuals as related or unrelated. |
div.thresh |
The threshold value used for determining if each pair of individuals is ancestrally divergent. |
nsamp |
The total number of samples in the analysis. |
nsnps |
The total number of SNPs used in the analysis, after filtering on |
MAF |
The minor allele frequency (MAF) filter used on SNPs. |
call |
The function call passed to |
method |
A character string. Either "PC-AiR" or "Standard PCA" identifying which method was used for computing principal components. |
The GenotypeData
function in the GWASTools
package should be used to create the input genoData
. Input to the GenotypeData
function can easily be created from an R matrix or GDS file. PLINK .bed, .bim, and .fam files can easily be converted to a GDS file with the function snpgdsBED2GDS
in the SNPRelate
package. Alternatively, the SeqVarData
function in the SeqVarTools
package can be used to create the input genodata
when working with sequencing data.
Matthew P. Conomos
Conomos M.P., Miller M., & Thornton T. (2015). Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness. Genetic Epidemiology, 39(4), 276-293.
Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., & Chen, W.M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.
pcairPartition
for a description of the function used by pcair
that can be used to partition the sample into 'unrelated' and 'related' subsets without performing PCA.
plot.pcair
for plotting.
king2mat
for creating a matrix of pairwise kinship coefficient estimates from KING output text files that can be used for kinMat
or divMat
.
GWASTools
for a description of the package containing the following functions: GenotypeData
for a description of creating a GenotypeData
class object for storing sample and SNP genotype data, MatrixGenotypeReader
for a description of reading in genotype data stored as a matrix, and GdsGenotypeReader
for a description of reading in genotype data stored as a GDS file. Also see snpgdsBED2GDS
in the SNPRelate
package for a description of converting binary PLINK files to GDS. The generic functions summary
and print
.
library(GWASTools) # file path to GDS file gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS") # read in GDS data HapMap_geno <- GdsGenotypeReader(filename = gdsfile) # create a GenotypeData class object HapMap_genoData <- GenotypeData(HapMap_geno) # load saved matrix of KING-robust estimates data("HapMap_ASW_MXL_KINGmat") # run PC-AiR mypcair <- pcair(genoData = HapMap_genoData, kinMat = HapMap_ASW_MXL_KINGmat, divMat = HapMap_ASW_MXL_KINGmat) close(HapMap_genoData)