Population structure also exists in non-human populations. In addition, in some rare cases, there might not be a large enough overlap between the user’s genotypes and the reference data (we recommend > 5% overlap) and consequently the pre-trained classifier. In this case, we recommend training your own random forest classifier. In the following, we will demonstrate the workflow for setting up this training, starting with reference data download. We will use a test study dataset included in the package as an example for the workflow.
Plink v2 is needed for this workflow.
A suitable reference dataset should be downloaded and if necessary,
re-formated into PLINK format. Vignettes Processing HapMap III
reference data for ancestry estimation and Processing
1000Genomes reference data for ancestry estimation show the
download and processing of the HapMap phase III and 1000Genomes phase
III dataset, respectively. In this example, we will use the 1000Genomes
data as the reference dataset. We recommend using
check_relatedness on the reference to filter our related
samples. This prevents data leakage between the training and testing
datasets.
We will first set up some bash variables and create directories needed; storing the names and directories of the reference and study will make it easy to use updated versions of the reference or new datasets in the future. We create a directory named ‘qcdir’ for the study data. In order to keep the data directory tidy, we will create a directory for the log files and move them to the log directory for future reference after each analysis step.
Once you have created ‘qcdir’, be sure to copy the study data (here simply data.bim, data.bed, and data.fam) files into your directory.
qcdir=~/qcdir
refdir=~/reference
name='data'
refname='all_hg38'
mkdir -p $qcdir/plink_log
In order to compute joint principal components of the reference and study population, we will need to combine the two datasets. The plink –merge function enables this merge, but requires the variants in the datasets to be matching by chromosome, position and alleles. The following sections show how to extract the relevant data from the reference and study dataset and how to filter matching variants.
We will use an awk script to find non-A/T and non-G/C SNPs, as these SNPs are more difficult to align, and remove them from both the reference and study data sets. In addition, we will only keep the autosomes (chr 1-22), only keep snps (snps-only), of those, keep only the biallelic ones (max-alleles 2), and remove any duplicates (rm-dup). The data should start off in Plink 1.9 data format.
awk 'BEGIN {OFS="\t"} ($5$6 == "GC" || $5$6 == "CG" \
|| $5$6 == "AT" || $5$6 == "TA") {print $2}' \
$qcdir/$name.bim > \
$qcdir/$name.ac_gt_snps
awk 'BEGIN {OFS="\t"} ($5$6 == "GC" || $5$6 == "CG" \
|| $5$6 == "AT" || $5$6 == "TA") {print $2}' \
$refdir/$refname.bim > \
$qcdir/$refname.ac_gt_snps
plink2 --bfile $refdir/$refname \
--rm-dup exclude-all \
--max-alleles 2 \
--snps-only just-acgt \
--exclude $qcdir/$refname.ac_gt_snps \
--chr 1-22 \
--make-bed \
--out $qcdir/$refname.no_ac_gt_snps
mv $qcdir/$refname.no_ac_gt_snps.log $qcdir/plink_log/$refname.no_ac_gt_snps.log
plink2 --bfile $qcdir/$name \
--rm-dup exclude-all \
--max-alleles 2 \
--snps-only just-acgt \
--exclude $qcdir/$name.ac_gt_snps \
--make-bed \
--chr 1-22 \
--allow-extra-chr \
--out $qcdir/$name.no_ac_gt_snps
mv $qcdir/$name.no_ac_gt_snps.log $qcdir/plink_log/$name.no_ac_gt_snps.log
Variant identifiers can vary between studies. To ensure that the data
can be properly processed, we will rename the variant identifiers to a
common scheme based on chromosome number and basepair location by using
plinkQC’s rename_variant_identifiers function.
library(plinkQC)
name <- "data.no_ac_gt_snps"
refname <- "all_hg38.no_ac_gt_snps"
path2plink2 <- "/Users/syed/bin/plink2"
rename_variant_identifiers(indir=qcdir, qcdir=qcdir, name=name,
path2plink2 = path2plink2)
After filtering for SNPS shared in the study, we will conduct marker and sample quality control on the reference dataset. This is to ensure only robust data is used as a reference. To reduce correlations between SNPs, we will prune for variants in linkage disequilibrium (LD) after removing low quality markers. This is done with plinkQC’s pruning_ld() function. As part of the function, ranges of known high-LD are excluded. These ranges are originally provided by [1].
refname <- "all_hg38.renamed.studysnps"
fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink,
verbose=TRUE, interactive=TRUE,
showPlinkOutput=FALSE)
marker_ids <- cleanData(indir = indir, qcdir = qcdir, name = name,
path2plink = path2plink, filterAncestry = FALSE, filterRelated = FALSE, macTh = NULL, mafTh = 0.05, verbose = TRUE,
filterSex = FALSE, filterHeterozygosity = FALSE,
filterSNPMissingness = TRUE, filterSampleMissingness = FALSE,
filterMAF = TRUE, filterHWE = TRUE)
name = paste(name, ".clean", sep = "")
pruning_ld(indir = indir, qcdir = qcdir, name = name,
path2plink = path2plink, genomebuild="hg38")
name = paste(name, ".pruned", sep = "")
fail_samples <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, dont.check_sex = TRUE,
dont.check_relatedness = TRUE,
path2plink=path2plink, dont.check_ancestry = TRUE,
interactive=TRUE, verbose=TRUE)
sample_ids <- cleanData(indir = indir, qcdir = qcdir, name = name,
path2plink = path2plink, filterAncestry = FALSE, filterRelated = FALSE,verbose = TRUE,
filterSex = FALSE, filterHeterozygosity = TRUE,
filterSNPMissingness = FALSE, filterSampleMissingness = TRUE,
filterMAF = FALSE, filterHWE = FALSE)
After conducting quality control, we will compute the joint principal components.
plink2 --bfile $refname.renamed.studysnps.cleaned.pruned.cleaned \
--nonfounders \
--freq counts \
--pca allele-wts 20 \
--out $refname.renamed.studysnps.cleaned.pruned.cleaned.pca
mv $qcdir/$refname.renamed.studysnps.cleaned.pruned.cleaned.pca $qcdir/plink_log
After running a PCA, we will project the original dataset onto the principal components to rescale. This ensures that future projections will be on the same scale.
plink2 --pfile $refname.renamed.studysnps.cleaned.pruned.cleaned
--read-freq $refname.renamed.studysnps.cleaned.pruned.cleaned.pca.acount
--score $refname.renamed.studysnps.cleaned.pruned.cleaned.pca.eigenvec.allele 2 6 header-read no-mean-imputation variance-standardize
--score-col-nums 7-26 --out $refname.renamed.studysnps.cleaned.pruned.cleaned.projection
The principal components of the reference dataset will be used to train the random forest algorithm in R.
library(tidyverse)
library(randomForest)
filepath <- 'insert path to sscore file here'
proj <- read.csv(
file= filepath,
sep='\t', header = TRUE)
package.dir <- find.package('plinkQC')
ancestry_info <-
read_delim(file.path(package.dir,"extdata/Genomes1000_ID2Pop.txt"))
superpop <-
read_delim(file.path(package.dir,"extdata/AncestryGeoLocations.csv"))
We divide up the data into a 70% training portion and 30% testing portion. To ensure a balanced model, we will sample the data to ensure an even representation of each ancestral superpopulation group in the testing portion of the data. The number of inidividuals from each group that is chosen will have to adjusted for the size of the reference dataset. For the 1000 genomes dataset, we use 448 samples in each of the five ancestral groups for a roughly 70% of the data.
proj <- proj %>%
select(-c(ALLELE_CT, NAMED_ALLELE_DOSAGE_SUM))
colnames(proj) <- c("IID", paste0("PC", 1:20))
labeled_proj <- merge(proj, superpop)
n_individuals <- 448
set.seed(123)
idx <- split(seq(nrow(labeled_proj)), labeled_proj$Ancestry)
train.ids <- lapply(idx, function(i) {
sample(i, n_individuals)
})
ids_to_keeps<- unlist(train.ids, use.names = FALSE, recursive = FALSE)
train_proj_1000g <- labeled_proj[ids_to_keeps,]
Then, we can train the random forest. We are providing a starting value for the parameter but we recommend utilizing a grid search (described in detail more below) to determine the optimal values.
train_proj_1000g$Ancestry <- factor(train_proj_1000g$Ancestry)
ancestry_rf <- randomForest(Ancestry ~ .,
data = train_proj_1000g[,-c(2,23)],
method = "rf",
ntree = 750,
importance = TRUE)
ancestry_rf
To predict the ancestries of a new study dataset with our trained random forest, you need to project the newdata onto the principal components of the reference dataset. This is done with the –score function in plink2. The projection is then used as input for the random forest.
plink2 --pfile newdata
--read-freq all_hg38.pca.acount
--score all_hg38.pca.eigenvec.allele 2 6 header-read no-mean-imputation variance-standardize
--score-col-nums 7-26 --out newdata.projection
Then, we can use the random forest to predict the ancestry of the new data.
filepath <- "Insert path to newdata.sscore here"
newdata <- read.csv(
file= filepath,
sep='\t', header = TRUE)
newdata <- newdata %>%
select(-c(ALLELE_CT, NAMED_ALLELE_DOSAGE_SUM))
colnames(newdata) <- c("ID", paste0("PC", 1:20))
predictions <- predict(ancestry_rf, newdata)
To evaluate the accuracy of the model, a commonly used metric is the out-of-bag (OOB) error rate and confusion matrix. The confusion matrix displays the number of correct classifications for each ancestry along the diagonal. Any values outside the diagonal are misclassifications for other ancestries. A summary of these metrics can be seen through:
ancestry_rf
And we can visualize the confusion matrix as a heatmap.
my_rf <- ancestry_rf$confusion
heatmap(my_rf[,-c(27)], scale = 'column', trace = 'none')
There are different optimization techniques to determine the parameters for a random forest. A simple one to implement is a grid-search where models with different parameters are initialized. Based on their performance, the optimal parameters can be determined. The parameters tested in our grid search are the number of principal components included and the number of trees. Within the parameter values of the number of principal components and number of trees, cross-validation is done to determine the mtry variable.
train_results <- data.frame(mtry = c(0), ntrees = c(0), num_pc = c(0), acc = c(0))
train_proj_noids <- train_proj_1000g[,-c(1)]
for (PC_inc in c(1:20)) {
for (ntree in c(1, 5)) {
set.seed(123)
fit <- train(Pop~., data=train_proj_noids[,c(1:PC_inc,21)],
method="rf", metric="Accuracy",
trControl=control, ntree=ntree)
train_proj_1000g <- rbind(train_proj_1000g,
data.frame(mtry = fit$results$mtry,
ntrees = rep(ntree, nrow(fit$results)),
num_pc = rep(PC_inc, nrow(fit$results)),
acc = fit$results$Accuracy))
}
}
Once you have found a successful OOB error rate, you can use the below package to find out important information regarding your tree. The below documentation builds a decision tree using categorical aspects to classify the data. The PC values for each sample are provided in our 1000 Genomes reference dataset file, where each sample has values for 20 distinct PCs. We ask if a certain sample’s value for a specific PC is less than or greater than a given number. Based on that answer, it is either classified as a specific ancestry or needs to undergo more investigation to classify it correctly. The visualization below is only a subset of a decision tree the random forest builds as classifying all 26 ancestries builds quite a complicated and cluttered plot.
# This is a visual of the random forest being made.
# https://github.com/araastat/reprtree
reprtree:::plot.getTree(rf_mtry)
# Two different plots
reprtree:::plot.getTree(rf_mtry, k=1, d=5)
reprtree:::plot.getTree(rf_mtry, k=1, d=6)