Verbal autopsy (VA) is a well-established approach to ascertaining the cause of death (COD) when medical certification and full autopsies are not feasible or practical (Garenne 2014; Taylor et al. 1983). After a death is identified, a specially-trained fieldworker interviews the caregivers (usually family members) of the decedent. A typical VA interview includes a set of structured questions with categorical or quantitative responses and a narrative section that records the “story” of the death from the respondent’s point of view (World Health Organization 2012).
This vignette aims to provide a quick introduction to the openVA package using simple case studies. For more detailed discussion of the package and related VA methods, please refer to (Zehang R. Li et al. 2021)
The openVA package (Zehang Richard Li et al. 2022) addresses this issue through an open-source toolkit. The openVA suite comprises a collection of R packages for the analysis of verbal autopsy data. The goal of this package is to provide researchers with an open-source tool that supports flexible data input formats and allows easier data manipulation and model fitting. The openVA suite consists of four core packages that are on CRAN, InterVA4 (Zehang R. Li et al. 2019), InterVA5 (Thomas et al. 2021), InSilicoVA (Zehang R. Li, McCormick, and Clark 2022), and Tariff (Zehang R. Li, McCormick, and Clark 2018), and an optional package nbc4va (Wen et al. 2018). Each of these packages implements one coding algorithm. For three of these algorithms – namely, InterVA-4, InterVA-5 and Tariff – there are also compiled software programs distributed by the original authors.
The openVA package is hosted on CRAN and can be installed with the following commands. For the analysis in this paper, we also install the nbc4va package separately from GitHub. The versions of the supporting packages can be checked in R using the openVA_status() function.
set.seed(12345)
library(openVA)
remotes::install_github("rrwen/nbc4va")
library(nbc4va)
openVA_status()The common modeling framework for VA data consists of first converting the survey responses into a series of binary variables, and then assigning a COD to each death based on the binary input variables. Typically, the target of inference consists of two parts: the individual cause-of-death assignment, and the population-level cause-specific mortality fractions (CSMF), i.e., the fraction of deaths due to each cause. In this section, we formally compare the modeling approaches utilized by each algorithm for these two targets. We adopt the following notations. Consider \(N\) deaths, each with \(S\) binary indicators of symptoms. Let \(s_{ij}\) denote the indicator for the presence of \(j\)-th symptom in the \(i\)-th death, which can take values 0, 1, or NA (for missing data). We consider a pre-defined set of causes of size \(C\). For the \(i\)-th death, denote the COD by \(y_i \in \{1, ..., C\}\) and the probability of dying from cause \(k\) is denoted by \(P_{ik}\). For the population, the CSMF of cause \(k\) is denoted as \(\pi_k\), with \(\sum_{k=1}^C \pi_k = 1\).
InterVA4 (Byass et al. 2012) and InterVA5 (Byass et al. 2019) algorithms calculate the probability of each COD given the observed symptoms using the following formula, \[ P_{ik} = \frac{\pi_{k}^{(0)} \prod_{j=1}^S P(s_{ij}=1|y_{i}=k) \mathbf{1}_{s_{ij} = 1}} {\sum_{k' = 1}^C \pi_{k'}^{(0)} \prod_{j=1}^S P(s_{ij}=1|y_{i}=k') \mathbf{1}_{s_{ij} = 1}} \] where both the prior distribution of each of the causes, \(\pi_{k}^{(0)}\) and the conditional probabilities \(P(s_{ij} = 1 | y_i = k)\) are fixed values provided in the InterVA software. It is worth noting that the formula does not follow the standard Bayes’ rule as it omits the probability that any symptom is absent. A detailed discussion of this modeling choice can be found in McCormick et al. (2016). The conditional probabilities, \(P(s_{ij}=1|y_{i}=k)\), used in InterVA algorithms are represented as rankings with letter grades instead of numerical values (Byass et al. 2012). For example, \(P(s_{ij}=1|y_{i}=k) = A+\) is translated into \(P(s_{ij}=1|y_{i}=k) = 0.8\), etc. The standard InterVA software only supports the fixed set of symptoms and causes where such prior information is provided. For a different data input format, this formulation can be easily generalized if training data are available. We include in the openVA package an extension of the algorithm that calculates \(\hat P(s_{ij}=1|y_{i}=k)\) from the empirical distribution in the training data and then maps to letter grades with different truncation rules. Details of the extended InterVA algorithm can be found in McCormick et al. (2016).
After the individual COD distributions are calculated, InterVA-4 utilizes a series of pre-defined rules to identify up to the top three most likely COD assignments, and truncates the probabilities for the rest of the CODs to 0 and adds an ‘undetermined’ category so that the probabilities sum up to 1 (See the user guide of Byass (2015)). Then the population-level CSMFs are calculated as the aggregation of individual COD distributions, such that \[ \pi_k = \sum_{i=1}^N P^*_{ik} \] where \(P^*_{ik}\) denotes the individual COD distribution after introducing the undetermined category.
Naive Bayes Classifier (Miasnikof et al. 2015) is very similar to the InterVA algorithm with two major differences. First, instead of considering only symptoms that present, the NBC algorithm also considers symptoms that are absent. Second, the conditional probabilities of symptoms given causes are calculated from training data instead of given by physicians, which is similar to our extension of InterVA discussed above. Similar to InterVA, the NBC method can be written as \[ P_{ik} = \frac{\pi_{k}^{(0)} \prod_{j=1}^S (P(s_{ij}=1|y_{i}=k) \mathbf{1}_{s_{ij} = 1} + P(s_{ij} \neq 1|y_{i}=k) \mathbf{1}_{s_{ij} \neq 1})} {\sum_{k' = 1}^C \pi_{k'}^{(0)} \prod_{j=1}^S (P(s_{ij}=1|y_{i}=k') \mathbf{1}_{s_{ij} = 1}+ P(s_{ij} \neq 1|y_{i}=k') \mathbf{1}_{s_{ij} \neq 1})} \] and the CSMFs are calculated by \(\pi_k = \sum_{i=1}^N P_{ik}\).
InSilicoVA algorithm (McCormick et al. 2016) assumes a generative model that characterizes both the CSMF at the population level, and the COD distributions at the individual level. In short, the core generative process assumes \[\begin{eqnarray} s_{ij} | y_i = k &\propto& \mbox{Bernoulli}(P(s_{ij} | y_i = k)) \\ y_i | \pi_1, ..., \pi_C &\propto& \mbox{Categorical}(\pi_1, ..., \pi_C) \\ \pi_k &=& \exp \theta_k / \sum_{k=1}^C \exp \theta_k \\ \theta_k &\propto& \mbox{Normal}(\mu, \sigma^2) \end{eqnarray}\] and hyperpriors are also placed on \(P(s_{ij} | y_i = k)\), \(\mu\), and \(\sigma^2\). The priors for \(P(s_{ij} | y_i = k)\) are set by the rankings used in InterVA-4 if the data are prepared into InterVA format, otherwise they are learned from training data. The priors on \(\mu\) and \(\sigma^2\) are diffuse uniform priors. Parameter estimation is performed using MCMC, so that a sample of posterior distributions of \(\pi_k\) can be obtained after the sampler converges.
Tariff algorithm (James et al. 2011) differs from the other three methods in that it does not calculate an explicit probability distribution of the COD for each death. Instead, for each death \(i\), a Tariff score is calculated for each COD \(k\) so that \[ Score_{ik} = \sum_{j = 1}^{S} \mbox{Tariff}_{kj}\mathbf{1}_{s_{ij}=1} \] where the symptom-specific Tariff score \(\mbox{Tariff}_{kj}\) is defined as \[ \mbox{Tariff}_{kj} = \frac{n_{kj} - median(n_{1j}, n_{2j}, ..., n_{Cj})} {IQR(n_{1j}, n_{2j}, ..., n_{Cj})} \] where \(n_{kj}\) is the count of how many deaths from cause \(k\) contain symptom \(j\) in the training data. The Tariff scores are then turned into rankings by comparing them to a reference distribution of scores calculated from re-sampling the training dataset to obtain a uniform COD distribution. It is worth noting that the Tariff algorithm produces the COD distribution for each death in terms of their rankings instead of the probability distributions. And thus the CSMF for each cause \(k\) is calculated by the fraction of deaths with cause \(k\) being the highest ranked cause, i.e., \[ \pi_k = \frac{\sum_{i=1}^N\mathbf{1}_{y_i = k}}{N} \]
In addition to the different model specifications underlying each algorithm, there is also a major conceptual difference in handling missing symptoms across the algorithms. Missing symptoms could arise from different stages of the data collection process. For example, the respondent may not know whether certain symptoms existed or may refuse to answer a question. From a statistical point of view, knowing that a symptom does not exist provides some information to the possible cause assignment, while a missing symptom does not. Although in theory, most of the VA algorithms could benefit from distinguishing ‘missing’ from ‘absent’, InSilicoVA is the only algorithm that has been implemented to acknowledge missing data. Missing indicators are assumed to be equivalent to ‘absence’ in InterVA, NBC, and Tariff.
In the openVA package, we consider two main types of standardized questionnaire: the WHO instrument and the IHME questionnaire. In this section, we focus on describing these two data formats and tools to clean and convert data. Pre-processing the raw data collected from the survey instrument (usually with Open Data Toolkit) is usually performed with additional packages and software outside of the analysis pipeline in R. We briefly mention software for data pre-processing towards the end of this paper.
For users familiar with InterVA software and the associated data processing steps, the standard input format from the WHO 2012 and 2016 instruments is usually well understood. For the 2012 instrument, the data expected by the InterVA-4 software are organized into a data frame where each row represents one death and the corresponding VA information is contained in \(246\) fields, starting from the first item being the ID of the death. The \(245\) items following the ID each represent one binary variable of symptom/indicator, where ‘presence’ is coded by ‘Y’, and ‘absence’ is coded by an empty cell.
To accommodate updates for the WHO 2016 instrument (D’Ambruoso et al. 2017), the InterVA-5 software accepts a data frame with \(354\) columns that include \(353\) columns of symptom/indicators followed by an additional column for the record ID. It should be noted that the R package InterVA5 retains the format with the record ID residing in the first column. Another important update with InterVA-5 is that it acknowledges the difference between “Yes” and “No” (or “Y/y” and “N/n”, which is different from the coding scheme in InterVA-4), both of which are processed as relevant responses, while all other responses are treated as missing values and ignored. With respect to the list of causes of death, InterVA-5 utilizes the WHO 2016 COD categories, which is nearly identical to the WHO 2012 COD categories (used by InterVA-4) except that hemorrhagic fever and dengue fever are two separate categories in the 2016 COD categories.
The same input format is inherited by the openVA package, except for one modification. We further distinguish ‘missing’ and ‘absent’ in the input data frame explicitly. We highly recommend that users pre-process all the raw data so that a ‘missing’ value in the data spreadsheet is coded as a ‘.’ (following the Stata practice familiar to many VA practitioners), and an ‘absent’ value is indicated by an empty cell, as in the standard InterVA-4 software. For WHO 2016 data, both ‘.’ and ‘-’ (following the default coding scheme of InterVA-5 software) are interpreted as missing values. For methods other than InSilicoVA, ‘missing’ and ‘absent’ will be considered the same internally and thus will not introduce a compatibility problem.
The Population Health Metrics Research Consortium (PHMRC) gold standard VA data (Murray et al. 2011) consist of three datasets corresponding to adult, child, and neonatal deaths, respectively. All deaths occurred in health facilities and gold-standard causes are determined based on laboratory, pathology and medical imaging findings. These datasets can be downloaded directly using the link returned by the function getPHMRC_url(). For example, we can read the adult VA dataset using the following command.
PHMRC_adult <- read.csv(getPHMRC_url("adult"))Although the data are publicly accessible, a major practical challenge for studies involving the PHMRC gold standard dataset is that the pre-processing steps described from the relevant publications are not clear enough nor easy to implement. The openVA package internally cleans up the PHMRC gold standard data when calling the codeVA() function on the PHMRC data. The procedure follows the steps described in the supplement material of McCormick et al. (2016). Notice that the original PHMRC data are useful for comparing and validating new methods, as well as for using as training data, but the cleaning functions only require that the columns are exactly the same as the PHMRC gold standard datasets, so they could also be used for new data that are pre-processed into the same format.
In addition to the two standard questionnaires discussed previously, researchers might also be interested in including customized dichotomous symptoms in their analysis. The openVA package also supports customized inputs as long as they are dichotomous. In such case, neither the built-in conditional probability matrix of InterVA nor the PHMRC gold standard dataset could be used to learn the relationship between training and testing data, thus different training data with known causes of death are necessary for all three algorithms. The ConvertData() function can be used to convert data with customized coding schemes into the format recognized by the openVA package.
In the first example, we demonstrate fitting InterVA and InSilicoVA on a random sample of \(1,000\) deaths from the ALPHA network without cause-of-death labels collected with the WHO 2012 instrument. The dataset is already included in the openVA package as a dataset RandomVA1 and can be loaded directly.
data(RandomVA1)
dim(RandomVA1)## [1] 1000  246head(RandomVA1[, 1:10])##   ID elder midage adult child under5 infant neonate male female
## 1 d1     Y                                             Y       
## 2 d2     Y                                                    Y
## 3 d3            Y                                      Y       
## 4 d4                  Y                                       Y
## 5 d5                  Y                                Y       
## 6 d6                  Y                                       YThe codeVA() function provides a standardized syntax to fit different VA models. Internally, the codeVA() function organizes the input data according to the specified data type, checks for incompatibility of the data and specified model, and calls the corresponding model fitting functions. It returns a classed object of the specified model class. In this example, we use version 4.03 of the InterVA software, which is the latest release of the original software compatible with the WHO 2012 instrument. Any additional model-specific parameters can be passed through the arguments of codeVA(). Here we specify the HIV and malaria prevalence levels required by the InterVA model to be ‘high’. Guidelines on how to set these parameters can be found in Byass et al. (2012).
fit_inter_who <- codeVA(data = RandomVA1, data.type = "WHO2012", 
                        model = "InterVA", version = "4.03", 
                        HIV = "h", Malaria = "h")summary(fit_inter_who) ## InterVA-4 fitted on 1000 deaths
## CSMF calculated using reported causes by InterVA-4 only
## The remaining probabilities are assigned to 'Undetermined'
## 
## Top 5 CSMFs:
##  cause                     likelihood
##  Undetermined              0.154     
##  HIV/AIDS related death    0.122     
##  Stroke                    0.072     
##  Reproductive neoplasms MF 0.058     
##  Pulmonary tuberculosis    0.055We can implement InSilicoVA method with similar syntax. We use the default parameters and run the MCMC for \(10,000\) iterations. Setting the auto.length argument to FALSE specifies that the algorithm does not automatically increase the length of the chain when convergence failed. The InSilicoVA algorithm is implemented using a Metropolis-Hastings within Gibbs sampler. The acceptance rate is printed as part of the message as the model samples from the posterior distribution.
fit_ins_who <- codeVA(RandomVA1, data.type = "WHO2012", model = "InSilicoVA",
                    Nsim = 10000, auto.length = FALSE)summary(fit_ins_who) ## InSilicoVA Call: 
## 1000 death processed
## 10000 iterations performed, with first 5000 iterations discarded
##  250 iterations saved after thinning
## Fitted with re-estimated conditional probability level table
## Data consistency check performed as in InterVA4
## 
## Top 10 CSMFs:
##                                    Mean Std.Error Lower Median Upper
## Other and unspecified infect dis  0.266    0.0168 0.235  0.265 0.301
## HIV/AIDS related death            0.102    0.0091 0.085  0.102 0.119
## Renal failure                     0.101    0.0108 0.084  0.101 0.123
## Other and unspecified neoplasms   0.062    0.0089 0.046  0.061 0.080
## Other and unspecified cardiac dis 0.058    0.0076 0.044  0.058 0.075
## Digestive neoplasms               0.050    0.0077 0.033  0.050 0.065
## Acute resp infect incl pneumonia  0.048    0.0073 0.034  0.049 0.063
## Pulmonary tuberculosis            0.039    0.0068 0.025  0.039 0.054
## Stroke                            0.038    0.0061 0.027  0.038 0.052
## Other and unspecified NCD         0.034    0.0089 0.018  0.034 0.052In the second example, we consider a prediction task using the PHMRC adult dataset. We first load the complete PHMRC adult dataset from its on-line repository, and organize it into training and test datasets. We treat all deaths from Andhra Pradesh, India as the test dataset.
PHMRC_adult <- read.csv(getPHMRC_url("adult"))
is.test <- which(PHMRC_adult$site == "AP")
test <- PHMRC_adult[is.test, ]
train <- PHMRC_adult[-is.test, ]
dim(test)## [1] 1554  946dim(train)## [1] 6287  946In order to fit the models on the PHMRC data, we specify data.type = "PHMRC" and phmrc.type = "adult" to indicate the data input is collected using the PHMRC adult questionnaire. We also specify the column of the causes-of-death label in the training data. The rest of the syntax is similar to the previous example.
When the input consists of both training and testing data, the InterVA and InSilicoVA algorithms estimate the conditional probabilities of symptoms using the training data, instead of using the built-in values. In such case, the version argument for the InterVA algorithm is suppressed. There are several ways to map the conditional probabilities of symptoms given causes in the training dataset to a letter grade system, specified by the convert.type argument. The convert.type = "quantile" performs the mapping so that the percentile of each rank stays the same as the original \(P_{s|c}\) matrix in InterVA software. Alternatively we can also use the original fixed values of translation, and assign letter grades closest to each entry in \(\hat{P}_{s|c}\). This conversion is specified by convert.type = "fixed", and is more closely aligned to the original InterVA and InSilicoVA setting. Finally, we can also directly use the values in the \(\hat{P}_{s|c}\) without converting them to ranks and re-estimating the values associated with each rank. This can be specified by convert.type = "empirical". In this demonstration, we assume the fixed value conversion.
fit_inter <- codeVA(data = test, data.type = "PHMRC", model = "InterVA", 
                     data.train = train, causes.train = "gs_text34", 
                     phmrc.type = "adult", convert.type = "fixed")fit_ins <- codeVA(data = test, data.type = "PHMRC", model = "InSilicoVA",
                    data.train = train, causes.train = "gs_text34", 
                    phmrc.type = "adult", convert.type = "fixed", 
                    Nsim=10000, auto.length = FALSE)The NBC and Tariff method can be fit using similar syntax.
fit_nbc <- codeVA(data = test, data.type = "PHMRC", model = "NBC", 
                   data.train = train, causes.train = "gs_text34", 
                   phmrc.type = "adult")fit_tariff <- codeVA(data = test, data.type = "PHMRC", model = "Tariff",
                     data.train = train, causes.train = "gs_text34", 
                     phmrc.type = "adult")Notice that we do not need to transform the PHMRC data manually. Data transformations are performed automatically within the codeVA() function.
In this section we demonstrate how to summarize results, extract output, and visualize and compare fitted results. All the fitted object returned by codeVA() are S3 objects, for which a readable summary of model results can be obtained with the summary() function as shown in the previous section. In addition, several other metrics are commonly used to evaluate and compare VA algorithms at either the population or individual levels. In the rest of this section, we show how to easily calculate and visualize some of these metrics with the openVA package.
We can extract the CSMFs directly using the getCSMF() function. The function returns a vector of the point estimates of the CSMFs, or a matrix of posterior summaries of the CSMF for the InSilicoVA algorithm.
csmf_inter <- getCSMF(fit_inter)
csmf_ins <- getCSMF(fit_ins)
csmf_nbc <- getCSMF(fit_nbc)
csmf_tariff <- getCSMF(fit_tariff)One commonly used metric to evaluate the CSMF estimates is the so-called CSMF accuracy, defined as \[ CSMF_{acc} = 1 - \frac{\sum_j^C CSMF_i - CSMF_j^{(true)}}{2(1 - \min CSMF^{(true)})} \] The CSMF accuracy can be readily computed using functions in openVA as the codes below shows.
csmf_true <- table(c(test$gs_text34, unique(PHMRC_adult$gs_text34))) - 1
csmf_true <- csmf_true / sum(csmf_true)
c(getCSMF_accuracy(csmf_inter, csmf_true, undet = "Undetermined"), 
  getCSMF_accuracy(csmf_ins[, "Mean"], csmf_true), 
  getCSMF_accuracy(csmf_nbc, csmf_true), 
  getCSMF_accuracy(csmf_tariff, csmf_true))## [1] 0.53 0.74 0.77 0.68We use the empirical distribution in the test data to calculate the true CSMF distribution, i.e., $CSMF_i^{(true)} = $. Then we evaluate the CSMF accuracy using the getCSMF_accuracy() function. As discussed previously, the default CSMF calculation is slightly different for diCfferent methods. For example, the InterVA algorithm creates the additional category of Undetermined by default, which is not in the true CSMF categories and needs to be specified. The creation of the undetermined category can also be suppressed by interVA.rule = FALSE in the getCSMF() function call. For the InSilicoVA algorithm, we use the posterior mean to calculate the point estimates of the CSMF accuracy.
At the individual level, we can extract the most likely cause-of-death assignment from the fitted object using the getTopCOD() function.
cod_inter <- getTopCOD(fit_inter)
cod_ins <- getTopCOD(fit_ins)
cod_nbc <- getTopCOD(fit_nbc)
cod_tariff <- getTopCOD(fit_tariff)With the most likely COD assignment, other types of metrics based on individual COD assignment accuracy can be similarly constructed by users. The summary methods can also be called for each death ID. For example, using the Tariff method, we can extract the fitted rankings of causes for the death with ID 6288 by
summary(fit_inter, id = "6288")## InterVA-4 fitted top 5 causes for death ID: 6288
## 
##  Cause                     Likelihood
##  Stroke                    0.509     
##  Pneumonia                 0.318     
##  COPD                      0.081     
##  Other Infectious Diseases 0.064     
##  Renal Failure             0.013The summary() function for InSilcoVA does not provide uncertainty estimates for individual COD assignments by default. This is because, in practice, the calculation of individual posterior probabilities of COD distribution can be memory-intensive when the dataset is large. To obtain individual-level uncertainty measurements, we can either run the MCMC chain with the additional argument indiv.CI = 0.95 when calling codeVA(), or update the fitted object directly with the saved posterior draws.
fit_ins <- updateIndiv(fit_ins, CI = 0.95)
summary(fit_ins, id = "6288")## InSilicoVA fitted top  causes for death ID: 6288
## Credible intervals shown: 95%
##                               Mean  Lower Median  Upper
## Stroke                      0.5043 0.3485 0.5083 0.6361
## Pneumonia                   0.4116 0.2615 0.4083 0.5834
## Other Infectious Diseases   0.0660 0.0411 0.0642 0.0966
## Epilepsy                    0.0099 0.0064 0.0097 0.0142
## COPD                        0.0053 0.0031 0.0052 0.0079
## Malaria                     0.0007 0.0005 0.0007 0.0011
## Diabetes                    0.0005 0.0003 0.0005 0.0009
## Acute Myocardial Infarction 0.0004 0.0003 0.0004 0.0006
## Falls                       0.0004 0.0001 0.0004 0.0013
## Renal Failure               0.0004 0.0002 0.0003 0.0005For \(N\) deaths, \(C\) causes, the posterior mean of individual COD distributions returned by the InSilicoVA model, along with median and with credible intervals can be represented by a \((N \times C \times 4)\)-dimensional array. The function getIndivProb() extracts this summary in the form of a list of \(4\) matrices of dimension \(N\) by \(C\), which can then be saved to other formats to facilitate further analysis. For other methods, the point estimates of individual COD distribution are returned as the \(N\) by \(C\) matrix.
fit_prob <- getIndivProb(fit_inter)
dim(fit_prob)## [1] 1554   34The previous sections discuss how results could be extracted and examined in R. In this subsection, we show some visualization tools provided in the openVA package for presenting these results. The fitted CSMFs for the top causes can be easily visualized by the plotVA() function. The default graph components are specific to each algorithm and individual package implementations, with options for further customization. For example, the figure below shows the estimated CSMF from the InterVA algorithm in the PHMRC data example.
plotVA(fit_inter, title = "InterVA")Top 10 CSMFs estimated by InterVA.
The CSMFs can also be aggregated for easier visualization of groups of causes. For the InterVA-4 cause list, we included an example grouping built into the package, so the aggregated CSMFs can be compared directly. In practice, the grouping of causes of deaths often needs to be determined according to context and the research question of interest. Changing the grouping can be easily achieved by modifying the grouping argument in the stackplotVA() function. For example, to facilitate the new category of Undetermined returned by InterVA, we first modify the grouping matrix to include it as a new cause and visualize the aggregated CSMF estimates in the figure below.
data(SampleCategory)
grouping <- SampleCategory
grouping[,1] <- as.character(grouping[,1])
grouping <- rbind(grouping, c("Undetermined", "Undetermined"))
compare <- list(InterVA4 = fit_inter_who,
                InSilicoVA = fit_ins_who)
stackplotVA(compare, xlab = "", angle = 0,  grouping = grouping)Estimated aggregated CSMFs for InterVA-4 and InSilicoVA, adding undetermined category.
The ordering of the stacked bars can also be changed to reflect the structures within the aggregated causes, as demonstrated in the figure below.
group_order <- c("TB/AIDS",  "Communicable", "NCD", "External", "Maternal",
            "causes specific to infancy", "Undetermined") 
stackplotVA(compare, xlab = "", angle = 0, grouping = grouping, 
            group_order = group_order)Estimated aggregated CSMFs for InterVA-4 and InSilicoVA, with the causes reordered.
Among the VA methods discussed in this paper, the InSilicoVA algorithm (McCormick et al. 2016) allows for more flexible modifications to the Bayesian hierarchical model structure when additional information is available. In this section, we illustrate two features unique to the InSilicoVA method: jointly estimating CSMFs from multiple populations, and incorporating partial and potentially noisy physician codings into the algorithm.
In practice researchers may want to estimate and compare CSMFs for different regions, time periods, or demographic groups in the population. Running separate models on subsets of data can be inefficient and does not allow parameter estimation to borrow information across different groups. The generative framework adopted by InSilicoVA allows the specification of sub-populations in analyzing VA data. Consider an input dataset with \(G\) different sub-populations. We can estimate different CSMFs \(\pi^{(g)}\) for \(g = 1, ..., G\) for each sub-population, while assuming the same conditional probability matrix, \(P_{s|c}\) and other hyperpriors. As an example, we show how to estimate different CSMFs for sub-populations specified by sex and age groups, using a randomly sampled ALPHA dataset with additional columns specifying the sub-population each death belongs to.
data(RandomVA2)
head(RandomVA2[, 244:248])##   stradm smobph scosts   sex age
## 1      .      .      .   Men 60+
## 2      .      .      . Women 60-
## 3      .      .      . Women 60-
## 4      .      .      . Women 60+
## 5      .      .      . Women 60-
## 6      .      .      . Women 60-Then we can fit the model with one or multiple additional columns specifying sub-population membership for each observation.
fit_sub <- codeVA(RandomVA2, model = "InSilicoVA", 
              subpop = list("sex", "age"),  indiv.CI = 0.95,
              Nsim = 10000, auto.length = FALSE)Functions discussed in the previous sections work in the same way for the fitted object with multiple sub-populations. Additional visualization tools are also available. The figure below plots the CSMFs for two sub-populations on the same plot by specify type = "compare".
plotVA(fit_sub, type = "compare", title = "Comparing CSMFs", top = 3)Estimated CSMFs for different sub-populations. The points indicate posterior means of the CSMF and the error bars indicate 95% credible intervals of the CSMF.
By default, the comparison plots will select all the CODs that are included in the top \(K\) for each of the sub-populations. We can also plot only subsets of them by specifying the causes of interest, as shown in the figure below.
plotVA(fit_sub, type = "compare", title = "Comparing CSMFs",
                causelist = c("HIV/AIDS related death", 
                              "Pulmonary tuberculosis", 
                              "Other and unspecified infect dis", 
                              "Other and unspecified NCD"))Estimated fraction of deaths due to selected CODs for different sub-populations. The points indicate posterior means of the CSMF and the error bars indicate 95% credible intervals of the CSMF.
The figure below shows the visualization of the top CSMFs for a chosen sub-population using the which.sub argument.
plotVA(fit_sub, which.sub = "Women 60-", title = "Women 60-")Top 10 CSMFs for a specified sub-population (women under \(60\) years old).
Similar to before, the stackplot() function can also be used to compare different sub-populations in aggregated cause groups, as shown in the figure below.
stackplotVA(fit_sub)
stackplotVA(fit_sub, type = "dodge")Aggregated CSMFs for four different sub-populations. Left: comparison using the stacked bar chart. Right: comparison using the bar chart arranged side to side. The height of the bars indicate posterior means of the CSMF and the error bars indicate 95% credible intervals of the CSMF.
When physician-coded causes of death are available for all or a subset of the deaths, we can incorporate such information in the InSilicoVA model. The physician-coded causes can be either the same as the CODs used for the final algorithm, or consist of causes at a higher level of aggregation. When there is more than one physician code for each death and the physician identity is known, we can first de-bias the multiple codes provided from different physicians using the process described in McCormick et al. (2016). For the purpose of implementation, we only need to specify which columns are physician IDs, and which are their coded causes, respectively.
data(SampleCategory)
data(RandomPhysician)
head(RandomPhysician[, 245:250])##   smobph scosts        code1 rev1        code2 rev2
## 1      .      .          NCD doc9          NCD doc6
## 2      .      .          NCD doc4          NCD doc3
## 3      .      .          NCD doc1          NCD doc5
## 4      .      .      TB/AIDS doc4      TB/AIDS doc7
## 5      .      .      TB/AIDS doc5      TB/AIDS doc9
## 6      .      . Communicable doc9 Communicable <NA>doctors <- paste0("doc", c(1:15))
causelist <- c("Communicable", "TB/AIDS", "Maternal",
              "NCD", "External", "Unknown")
phydebias <- physician_debias(RandomPhysician, 
          phy.id = c("rev1", "rev2"), phy.code = c("code1", "code2"), 
          phylist = doctors, causelist = causelist, tol = 0.0001, max.itr = 100)The de-biased step essentially creates a prior probability distribution for each death over the broader categories of causes. Then to run InSilicoVA with the de-biased physician coding, we can simply pass the fitted object from the previous step to the model. Additional arguments are needed to specify both the external cause category, since external causes are handled by separate heuristics, and the unknown category, which is equivalent to a uniform probability distribution over all other categories, i.e., the same as the case where no physician coding exists.
fit_ins_phy <- codeVA(RandomVA1, model = "InSilicoVA",
             phy.debias = phydebias, phy.cat = SampleCategory, 
             phy.external = "External", phy.unknown = "Unknown",
             Nsim = 10000, auto.length = FALSE) The figure below compares the previous results without including physicians codes.
plotVA(fit_ins_who, title = "Without physician coding")
plotVA(fit_ins_phy, title = "With physician coding")Comparing fitted CSMF with and without physicians
The originally proposed InSilicoVA assumes all causes of death are possible for each observation. The impact from such an assumption is mild when data are abundant, but could be problematic when either the sample size is small or the proportion of missing data is high. In both cases, physically impossible causes might get assigned with non-ignorable posterior mass. Since version 1.1.5 of InSilicoVA, when the input is in the WHO format, the algorithm automatically checks and removes impossible causes before fitting the model. The \(k\)-th cause is defined as physically impossible for the \(i\)-th death if \(P(s_{ij}=1 | y_{i}=k) = 0\) where \(s_{ij}\) is an indicator that the decedent belongs to a particular sex or age group. We then consider a cause to be physically impossible for the underlying population if it is impossible for all the observations in the input data. For example, with the new implementation, CSMF for pregnancy-related causes will not be estimated if the input data consist of only male deaths.
This work was supported by grants K01HD078452, R01HD086227, and R21HD095451 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD). Funding was also received from the Data for Health Initiative, a joint project of Vital Strategies, the CDC Foundation and Bloomberg Philanthropies, through Vital Strategies. The views expressed are not necessarily those of the initiative.