In this vignette, we demonstrate the NetGSA workflow using a subset of the breast cancer data example downloaded from the Cancer Genome Atlas (TCGA 2012). In particular, we illustrate how to incorporate existing network information (e.g. curated edges from KEGG) to improve the power of pathway enrichment analysis with NetGSA. Details of the method are avaialble in Ma, Shojaie, and Michailidis (2016).
netgsa provides simple functions that automatically
search known gene databases using graphite for network
information and integrate seamlessly with NetGSA pathway
enrichment andigraph and Cytoscape plotting. In this
vignette, we demonstrate the NetGSA workflow explaining proper data
types and usage of the netgsa package.
Our example data set comes from a breast cancer study (TCGA 2012), which consists of gene expression data from 520 subjects including 117 estrogen-receptor-negative (ER-) and 403 estrogen-receptor-positive (ER+). The goal is to generate diagnostic pathway signatures that could distinguish patients with different ER statuses by comparing gene expression data from the two classes. These signatures could provide additional clinical benefit in diagnosing breast cancer.
NetGSA works directly with the expression data matrix. When loading the data, it is important to check the distribution of raw sequencing reads and perform log transformation if necessary. Data in this example were already log transformed. Rows of the data matrix correspond to genes and columns to subjects. Genes in this data matrix were labeled with Entrez IDs, same as those used in KEGG pathways.
## [1] "edgelist"     "group"        "nonedgelist"  "pathways"     "pathways_mat"
## [6] "x"The objects loaded which are necessary for this vignette are:
x, the data matrixedgelist, a data.frame containing user-specified
edgesnonedgelist, a data.frame containing user-specified
non-edgesgroup, a vector mapping columns of x to
conditionspathways, a list of KEGG pathways,pathways_mat, an indicator matrix of KEGG pathways to
be used directly in NetGSAxThe data matrix, x must have rownames that are named as
"GENE_ID:GENE_VALUE". Since netgsa allows the
user to search for edges in multiple databases in graphite,
each of which may use different identifiers (e.g. KEGG, Reactome,
Biocarta, etc.), netgsa needs to know the identifier for a
given gene so we can convert it properly. For example, if you have two
genes “ENTREZID: 7186” and “ENTREZID: 329” and want to search for
potential edges between these two within Reactome, netgsa
must first convert these genes to their “UNIPROT” IDs and use those to
search for edges.
netgsa uses the AnnotationDbi package to
convert genes. The valid list of "GENE_ID"’s is:
## ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"An example of what the rownames for the data matrix, x,
should look like is:
## [1] "ENTREZID:127550" "ENTREZID:53947"  "ENTREZID:65985"  "ENTREZID:51166" 
## [5] "ENTREZID:15"     "ENTREZID:60496"The columns of the data matrix do not need to be named, but it is useful for keeping track of groups.
The group object is a vector (numeric or character) and
must be the same length as ncol(x). Each element of the
group tells us which group each column of x
corresponds to. For example, if group[1] = "Positive" this
says that the first column of x corresponds to the
"Positive" condition.
We can find out the ER status by looking at the group labels.
## group
##   1   2 
## 403 117The edgelist and non-edgelist are strings representing file locations
and are read in using data.table’s fread()
command. These are where users can specify known edges/non-edges of
their own. Each observation is assumed to be a directed edge (for
edgelist) or a directed non-edge (for non-edgelist). They both must have
4 columns in the following order. The columns do not necessarily need to
be named properly, they simply must be in this specific order:
Note it is assumed that each edge/non-edge is directed so if you want an undirected edge/non-edge you should put in two observations as in:
##    base_gene_src base_id_src base_gene_dest base_id_dest
##           <char>      <char>         <char>       <char>
## 1:          7534    ENTREZID           8607     ENTREZID
## 2:          8607    ENTREZID           7534     ENTREZIDLastly, as documented in ?NetGSA we need an indicator
matrix of pathways. The rows of this matrix should correspond to
pathways and the columns to the genes included in the analysis. The
dimension is then npath x p. Values in this matrix
should be 1 for genes that are in a given pathway and 0 otherwise.
We consider pathways from KEGG (Kanehisa and Goto 2000). KEGG pathways can be accessed in R using the graphite package.
## "Glycolysis / Gluconeogenesis" pathway
## Native ID       = hsa:00010
## Database        = KEGG
## Species         = hsapiens
## Number of nodes = 93
## Number of edges = 1128
## Retrieved on    = 22-10-2024
## URL             = http://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa&mapno=00010## [1] "ENTREZID:10327" "ENTREZID:124"   "ENTREZID:125"   "ENTREZID:126"  
## [5] "ENTREZID:127"   "ENTREZID:128"For this vignette, we’ll use pathways_mat:
##                                                      ENTREZID:18
## Adipocytokine signaling pathway                                0
## Adrenergic signaling in cardiomyocytes                         0
## AGE-RAGE signaling pathway in diabetic complications           0
## Alanine, aspartate and glutamate metabolism                    1
## AMPK signaling pathway                                         0The 1 shown indicates that "ENTREZID: 18" belongs to the
Alanine, aspartate and glutamate metabolism pathway.
In summary, the data that we’ll need to run our pathway enrichment analysis is:
x), with rows for genes and columns
for samples,group) for ER status,edgelist) to be read in with fread(),nonedgelist) to be read in with fread(),pathways_mat).Now that we have our data set-up properly we can begin with the
pathway enrichment analysis. netgsa has three main
functions. The first is prepareAdjMat which searches for
edges in public databases and uses this information to estimate the
adjacency matrices needed for NetGSA. The second is
NetGSA which performs network-based gene set analysis. We
will go into further details about the usage of these functions
below.
After having the data set-up, the first step in pathway enrichment
analysis with netgsa is to estimate the adjacency matrices.
Remember, the rownames of the data matrix X must be named
as "GENE_ID:GENE_VALUE" as in "ENTREZID:7534".
The group vector is defined as above.
The databases argument can be either (1) the result of
obtainEdgeList or (2) a character vector defining the
databases to search. These are essentially the same thing, the only
difference is that for the character vector method,
obtainEdgeList is called inside prepareAdjMat
and cannot be saved. Since prepareAdjMat queries the
graphite databases when it is called and
graphite databases can change overtime, this may not be
desirable for reproducibility. Using the obtainEdgeList
method, one can save the edgelist to ensure the same network information
is used across iterations or in the future. In both methods, one must
specify the databases to search. The options are the databases for homo
spaiens available in graphite or NDEx (only for development
version on Github). The graphite databases are:
## [1] "kegg"         "panther"      "pathbank"     "pharmgkb"     "reactome"    
## [6] "smpdb"        "wikipathways"Note the databases argument is case sensitive so make
sure to pass "reactome" and not
"Reactome".
The cluster argument controls whether or not clustering
is used when estimating the adjacency matrix. The default behavior is to
use clustering if p > 2,500. However, the user can
override this behavior by setting the cluster argument.
prepareAdjMat chooses the best clustering method from 6
possible methods in the igraph package. More details are
provided in ?prepareAdjMat.
User specified edge and non-edge files are specified with the
file_e and file_ne arguments respectively. For
more information on the assumptions and how network information is
incorporated from the database edgelists, the user edges, and the user
non-edges see the Details section and file_e and
file_ne parameters in ?prepareAdjMat.
It is also important to note that prepareAdjMat will
automatically choose the correct network estimation technique based on
whether or not the graph is directed so no additional work is needed to
determine undirected vs directed graphs. This is documented in
?prepareAdjMat.
Now let’s get to some example code. Suppose we wanted to estimate the network for our example data using our known edges/non-edges and searching for edges in Reactome, KEGG, and BioCarta. Let’s also use clustering to speed up computation. We could do this simply by:
database_search <- obtainEdgeList(rownames(x), c("kegg", "reactome"))
network_info <- prepareAdjMat(x, group, database_search,
                                         cluster = TRUE, file_e = "edgelist.txt", 
                                         file_ne = "nonedgelist.txt")The main value of interest is network_info[["Adj"]], a
list of lists. The top level list indicates the condition while the next
level is a list of adjacency matrices (one for each cluster). If
cluster = FALSE there is an element for each connected
component in the network. Also note that in this example, the object
database_search was created. If desired, this can be saved
and used later to ensure the same network information was used. However,
the results may not be exactly the same. For example, some of the
clustering algorithms are not deterministic so while the network
information may be the same the clusters may be different resulting in
different adjacency matrices. The adjacency matrix for the first
condition and first cluster can be accessed with:
##             ENTREZID:18 ENTREZID:27 ENTREZID:28
## ENTREZID:18           0           0           0
## ENTREZID:27           0           0           0
## ENTREZID:28           0           0           0The total number of clusters can be identified with:
## [1] 16Note we cannot control cluster size which is why we try several
methods and implement rules on which to choose. For more information see
?prepareAdjMat.
Now that the adjacency matrices are assembled we can perform pathway
inference using NetGSA. The returned value from
prepareAdjMat will always be in the correct format
regardless of whether or not clustering is used, so one should always be
able to pass network_info[["Adj"]] directly to
NetGSA. The default is to use restricted Haseman-Elston
regression (REHE) with sampling to estimate the variance parameters.
This allows for significant computational speed up and little to no loss
in power. One can also use REML for variance parameter estimation but
this can be quite slow for problems with moderate or large dimension.
For explanatory purposes, we explicitly write out the function:
pathway_tests_rehe <- NetGSA(network_info[["Adj"]], x, group, pathways_mat, 
                             lklMethod = "REHE", sampling = TRUE, 
                             sample_n = 0.25, sample_p = 0.25)The sample_n and sample_p parameters
control the ratio for subsampling observations (columns of data matrix
x) and variables (rows of data matrix x)
respectively. More information on REHE and sampling can be found in the
?NetGSA Details section.
Note that inference using REML can take several hours depending on
the complexity of the problem, while inference using REHE can take
minutes. Roughly, REML becomes quite slow for p > 2,000. In this
example, with sample_n = 0.25, sample_p = 0.25
only took about 2 minutes on a 2 CPU personal computer with 16 GB of
RAM.
The main inference results are stored in
pathway_tests[["results"]] and contain the pathway names as
well as their significance (p-values and q-values are reported) and test
statistics.
netgsa provides several options for visualizing the
results from NetGSA. Cytoscape or igraph is used to display
the main results depending on whether the user has the Cytoscape app
open.
If the user has Cytoscape open on their computer, calling
plot.NetGSA will create several plots:
Cytoscape plots - the first place
plot.NetGSA generates plots is within Cytoscape. A nested
network is created where the main network (Pathway Network) displays
pathways as nodes. Edges in this network represent edges between genes
contained within those pathways. In this network, the
results object from NetGSA() is loaded as the
node data and the number of edges between genes of separate pathways are
loaded as edge data in a variable called “weight”. Node color is mapped
to the test statistic. Negative values are orange, values around 0 are
white, and positive values are blue. Node border color is mapped to
q-values going from red (0) to white (1).
In addition to the main network, a network is generated for each pathway. These are the gene networks underlying each pathway and are referred to as the within pathway networks. The within pathway networks are linked to the nodes in the Pathway Network using Cytoscape’s nested network format. An example will make this more clear. Calling:
The Pathway Network might look something like:
The most interesting part of the Cytoscape nested network framework
is the ability to view the within pathway networks. Suppose you were
looking at the Pathway Network and noticed the ErbB signaling
network was highly significant and you wanted to investigate it by
looking at the underlying gene network. To do this, one could simply
right click on the ErbB signaling pathway node in the Pathway
Network, and go to Nested Networks -> Go To Nested Network.
Alternatively, the nested networks have the same name as the pathway
they represent so one could simply navigate to the ErbB signaling
pathway using the Network tab in the Control
Panel on the left side of the Cytoscape GUI. To save time, the
nested networks are not formatted. However, we can use the
formatPathways function to format one or multiple using the
default style.
The formatPathways function colors gene nodes based on
FDR adjusted q-value. For more information see
?formatPathways.
To export images from Cytoscape one can either use the GUI or the
RCy3::exportImage() function.
Cytoscape legend generated in R - Legends are difficult to incorporate on the plot within Cytoscape so we generate a legend in R. Alternatively it is also possible to generate an image directly in Cytoscape. The Cytoscape manual page has more information on that: http://manual.cytoscape.org/en/stable/Styles.html.
Igraph plot - Lastly, for users that prefer R,
plot.NetGSA will also generate a plot using
igraph with the same layout and node color/node border
color mapping as in Cytoscape.
The Cytoscape plotting section is wrapped up with a few notes on
customization. While plot.NetGSA has a default layout, a
custom layout can be provided to both plot.NetGSA and
formatPathways using the graph_layout
parameter. For Cytoscape, this parameter is passed as a string to
RCy3::layoutNetwork so it may be helpful to read that
documentation. Alternatively, one does not even need to use this
parameter. They can set up a custom layout using the Cytoscape GUI or
can even use the RCy3::layoutNetwork function directly. For
example, calling layoutNetwork("degree-circle") will change
the layout of the current network selected in Cytoscape.
# Format the "Neurotrophin signaling pathway" using the "degree-circle" layout
formatPathways(pathway_tests_rehe, "Neurotrophin signaling pathway",
             graph_layout = "degree-circle")Lastly, for users who want even more customization, the
results object returned from NetGSA() and the
edge weight between pathways are loaded into Cytoscape. So to size edges
based on edge weight one could:
If the user does not have Cytoscape open, a 3-D igraph plot is
created using igraph::rglplot. Due to the nature of the
igraph::rglplot function we only map test statistics to
node color. We are not able to map colors to both the node and node
border.
Again, a custom layout can be specified with the
graph_layout parameter. As documented in
?plot.NetGSA this must be a function that takes one
parameter which is passed to the igraph::rglplot function.
For example to layout a graph randomly we can do:
layout_fun <- function(ig) igraph::layout_randomly(ig)
plot(pathway_tests_rehe, graph_layout = layout_fun)While igraph does not provide a similar nested network format as
Cytoscape, we can use the zoomPathway function to look at
the gene interactions within a pathway: