r3Cseq 1.29.0
The coupling of chromosome conformation capture (3C)-based and next-generation sequencing (NGS) enable high-throughput detection of long-range genomic interactions via the generation of novel ligation products between DNA sequences that are closely juxtaposed in vivo. These interactions may involve promoter regions, enhancers and other regulatory and structural elements of chromosomes, and can reveal key details in the regulation of gene expression. 3C-seq is a variant of the method for the detection of interactions between one chosen genomic element (viewpoint) and the rest of the genome. We present a R/Bioconductor package designed to perform 3C-seq data analysis in a number of different experimental designs, with or without a control experiment. The package can also be used to perform data analysis for the experiment with replicates. The package provides functions to perform 3C-seq data normalization, statistical analysis for cis/trans interactions and visualization to facilitate the identification of genomic regions that physically interact with the given viewpoints of interest. The r3Cseq package greatly facilitates hypothesis generation and the interpretation of experimental results.
This vignette describes how to use the r3Cseq package. r3Cseq (S. Thongjuea et al. 2013) is a Bioconductor-compliant R package designed to facilitate the identification of interaction regions generated by chromosome conformation capture and next-generation sequencing (3C-seq). The fundamental principles of 3C-seq briefly described in the following (E. Soler et al. 2010) , isolated cells are treated with a cross-linking agent to preserve in vivo nuclear proximity between DNA sequences. The DNA isolated from these cells is then digested using a primary restriction enzyme, typically a 6-base pairs cutting enzyme such as HindIII, EcoRI or BamHI. The digested product is then ligated under dilute conditions to favor intra-molecular over inter-molecular ligation events. This digested and ligated chromatin yields composite sequences representing (distal) genomic regions that are in close physical proximity in the cell nucleus. The digested and ligated chromatin is then de-crosslinked and subjected to a second restriction digest using either Nla III or Dpn II (a 4-cutter) as a secondary restriction enzyme to decrease the fragment sizes. The digested DNA is then ligated again under diluted conditions, creating small circular fragments. These fragments are inverse PCR-amplified using primers specific for a genomic region of interest (eg. promoter, enhancer, or any other element potentially involved in long-range interactions), termed the “viewpoint”. The amplified fragments are then sequenced using massively parallel high-throughput sequencing. Because the 3C-seq procedure hybrid DNA molecules being a combination of viewpoint-specific primers followed by sequences dervied from the ligated interaction fragments. As such, these composite sequences are unmappable and need to be trimmed to removed the viewpoint sequences, thus leaving only the capture sequence fragments for mapping. After trimming, reads are mapped against a reference genome using alignment software such as Bowtie. A mapped read file generated by the mapping software is then transformed to the BAM file and analyzed using the r3Cseq package.
The required input is the BAM file, obtained as an output from the mapping software. The BAM file name must not contain any special symbols such as ’*‘,’%‘,’$‘,’#‘,’!‘,’@’, and “-”. The “-” and “.” must be replaced by using “_“. The file name must be short and simple. For example, the original file name”%RIKO-R1-T5_S15_L001.trimmed_experiment.sorted.bam" must be changed to “RIKO_R1_T5_S15_L001.bam”.
The represented identifier for a reference genome shown in each input BAM file is important to properly run r3Cseq. The represented identifier for each chromosome must be in “chr[1..19XYM]” format for the mouse reference genome and “chr[1..22XYM]” format for the human reference genome. Therefore, before using the package, a user has to check the identifier for the reference genome. If the identifier for each chromosome found in the mapped file is not in a proper format for example ‘mm9_ref_chr01.fa’, the Unix command like ‘sed’ might be used to replace ‘mm9_ref_chr01.fa’ to ‘chr1’.
3C-seq data generated by (Stadhouders et al. 2012) will be used for the demonstration. The current version of r3Cseq supports mouse, human, and rat genomes. Therefore, the package requires one of the followings BSgenome packages to be installed; BSgenome.Mmusculus.UCSC.mm9.masked, BSgenome.Mmusculus.UCSC.mm10.masked, BSgenome.Hsapiens.UCSC.hg18.masked,BSgenome.Hsapiens.UCSC.hg19.masked, and BSgenome.Rnorvegicus.UCSC.rn5.masked.
Loading the r3Cseq package into R.
library(r3Cseq)
There are 2 example data sets found in the package.
data(Myb_prom_FL)
data(Myb_prom_FB)
Myb_prom_FL, the 3C-seq data contains the aligned reads of the Myb promoter interactions signal in fetal liver. It was stored in the ‘GRanges’ object processed by the ‘Rsamtools’ package.
Myb_prom_FB, the 3C-seq data contains the aligned reads of Myb promoter interactions signal in fetal brain.
We will next perform r3Cseq to discover interaction regions, which possibly interact with the promoter region of Myb gene in both fetal liver and brain (Stadhouders et al. 2012).
In this section, we will analyze 3C-seq data, which were derived from fetal liver (high levels of the Myb gene expression) and fetal brain (expressing low level of the Myb gene). The latter will be used as a negative control. More examples of r3Cseq data analysis can be found on the r3Cseq website http://r3cseq.genereg.net. We firstly initialize the r3Cseq object.
my3Cseq.obj<-new("r3Cseq",organismName='mm9',isControlInvolved=TRUE,
viewpoint_chromosome='chr10',viewpoint_primer_forward='TCTTTGTTTGATGGCATCTGTT',
viewpoint_primer_reverse='AAAGGGGAGGAGAAGGAGGT',expLabel="Myb_prom_FL",
contrLabel="MYb_prom_FB",restrictionEnzyme='HindIII')
The description of input parameters is described in the r3Cseq help page. We next add raw reads from “Myb_prom_FL” and “Myb_prom_FB” to the existing my3Cseq.obj.
expRawData(my3Cseq.obj)<-exp.GRanges
contrRawData(my3Cseq.obj)<-contr.GRanges
my3Cseq.obj
## An object of class "r3Cseq"
## Slot "alignedReadsBamExpFile":
## [1] ""
##
## Slot "alignedReadsBamContrFile":
## [1] ""
##
## Slot "expLabel":
## [1] "Myb_prom_FL"
##
## Slot "contrLabel":
## [1] "MYb_prom_FB"
##
## Slot "expLibrarySize":
## integer(0)
##
## Slot "contrLibrarySize":
## integer(0)
##
## Slot "expReadLength":
## integer(0)
##
## Slot "contrReadLength":
## integer(0)
##
## Slot "expRawData":
## GRanges object with 2478476 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 3005887-3005930 +
## [2] chr1 3005887-3005930 +
## [3] chr1 3005887-3005930 +
## [4] chr1 3005887-3005930 +
## [5] chr1 3005887-3005930 +
## ... ... ... ...
## [2478472] chrM 12341-12384 +
## [2478473] chrM 12341-12384 +
## [2478474] chrM 12341-12384 +
## [2478475] chrM 12341-12384 +
## [2478476] chrM 12341-12384 +
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## Slot "contrRawData":
## GRanges object with 2838018 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 3046711-3046754 -
## [2] chr1 3046711-3046754 -
## [3] chr1 3046711-3046754 -
## [4] chr1 3046711-3046754 -
## [5] chr1 3402788-3402831 +
## ... ... ... ...
## [2838014] chrM 11987-12030 +
## [2838015] chrM 11987-12030 +
## [2838016] chrM 11987-12030 +
## [2838017] chrM 11987-12030 +
## [2838018] chrM 11990-12033 +
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## Slot "organismName":
## [1] "mm9"
##
## Slot "restrictionEnzyme":
## [1] "HindIII"
##
## Slot "viewpoint_chromosome":
## [1] "chr10"
##
## Slot "viewpoint_primer_forward":
## [1] "TCTTTGTTTGATGGCATCTGTT"
##
## Slot "viewpoint_primer_reverse":
## [1] "AAAGGGGAGGAGAAGGAGGT"
##
## Slot "expReadCount":
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: no sequences
##
## Slot "contrReadCount":
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: no sequences
##
## Slot "expRPM":
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: no sequences
##
## Slot "contrRPM":
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: no sequences
##
## Slot "expInteractionRegions":
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: no sequences
##
## Slot "contrInteractionRegions":
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: no sequences
##
## Slot "isControlInvolved":
## [1] TRUE
To get number of reads per restriction fragment, function getReadCountPerRestrictionFragment will be performed.
getReadCountPerRestrictionFragment(my3Cseq.obj)
## [1] "Fragmenting genome by....HindIII...."
## [1] "Count processing in the experiment is done."
## [1] "Count processing in the control is done."
The package also provides the function getReadCountPerWindow to count number of reads per non-overlapping window size defined by a user.
We next perform data normalization.
calculateRPM(my3Cseq.obj)
## [1] "Normal RPM calculation is done."
After normalization, the getInteractions function will be performed.
getInteractions(my3Cseq.obj,fdr=0.05)
## [1] "Calculation is done. Use function 'expInteractionRegions' or 'contrInteractionRegions' to get the result."
To see the result of interaction regions, Two functions expInteractionRegions and contrInteractionRegions can be used to access the slot of r3Cseq object. To get the result of interaction regions for the experiment, expInteractionRegions will be performed.
fetal.liver.interactions<-expInteractionRegions(my3Cseq.obj)
fetal.liver.interactions
## GRanges object with 19502 ranges and 5 metadata columns:
## seqnames ranges strand | nReads RPMs
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr10 3009255-3014345 * | 3 0.0605737657161655
## [2] chr10 3067271-3068193 * | 122 11.8504107319424
## [3] chr10 3127452-3132355 * | 101 9.05557622062603
## [4] chr10 3132360-3134033 * | 70 5.37270060859088
## [5] chr10 3141247-3145770 * | 113 10.6253420244925
## ... ... ... ... . ... ...
## [19498] chrY 1229367-1231265 * | 2 0.0340049639221147
## [19499] chrY 1279671-1287427 * | 2 0.0340049639221147
## [19500] chrY 1293808-1295589 * | 137 13.9779584889841
## [19501] chrY 1686797-1687184 * | 2 0.0340049639221147
## [19502] chrY 2641919-2642337 * | 1 0.0126734717809562
## z p.value q.value
## <numeric> <numeric> <numeric>
## [1] -0.173110045707998 1 1
## [2] 0.106335303551167 0.915316322182506 1
## [3] 0.0407624174864911 0.967485300942551 1
## [4] -0.0368364336893386 1 1
## [5] 0.0673961352730922 0.946266345800746 1
## ... ... ... ...
## [19498] -0.480154868476184 1 1
## [19499] -0.480154868476184 1 1
## [19500] 0.247537676244485 0.804492137104243 1
## [19501] -0.480154868476184 1 1
## [19502] -0.485545183622263 1 1
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
To get the result of interaction regions for the control, contrInteractionRegions will be performed.
fetal.brain.interactions<-contrInteractionRegions(my3Cseq.obj)
fetal.brain.interactions
## GRanges object with 9684 ranges and 5 metadata columns:
## seqnames ranges strand | nReads RPMs
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr10 3019822-3022796 * | 425 44.8518378170537
## [2] chr10 3063618-3067025 * | 525 59.1090414506868
## [3] chr10 3087994-3096004 * | 309 29.5772330684001
## [4] chr10 3138895-3141226 * | 750 94.1872849616863
## [5] chr10 3148290-3157485 * | 4 0.101138955526621
## ... ... ... ... . ... ...
## [9680] chrY 1686797-1687184 * | 1 0.01653813879084
## [9681] chrY 1690876-1691262 * | 1 0.01653813879084
## [9682] chrY 2371312-2391988 * | 1 0.01653813879084
## [9683] chrY 2884699-2887174 * | 1 0.01653813879084
## [9684] chrY 2890136-2899310 * | 1 0.01653813879084
## z p.value q.value
## <numeric> <numeric> <numeric>
## [1] 0.22836348764507 0.81936367216419 1
## [2] 0.381850355068183 0.702572365787877 1
## [3] 0.0709858197887684 0.943409041188086 1
## [4] 0.72291877311053 0.469729789135378 1
## [5] -0.363637905503913 1 1
## ... ... ... ...
## [9680] -0.474103528881472 1 1
## [9681] -0.474103528881472 1 1
## [9682] -0.474103528881472 1 1
## [9683] -0.474103528881472 1 1
## [9684] -0.474103528881472 1 1
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
To see the viewpoint information, getViewpoint function can be used. getViewpoint will return the GRanges object of the viewpoint information.
viewpoint<-getViewpoint(my3Cseq.obj)
viewpoint
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr10 20880662-20880775 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
r3Cseq package provides visualization functions. These functions are plotOverviewInteractions, plotInteractionsNearViewpoint, plotInteractionsPerChromosome, and PlotDomainogramNearViewpoint.
plotOverviewInteractions function shows the overview of interaction regions distributed across genome.
plotOverviewInteractions(my3Cseq.obj)
plotInteractionsNearViewpoint function shows the zoom in of interaction regions located close to the viewpoint.
plotInteractionsNearViewpoint(my3Cseq.obj)
plotInteractionsPerChromosome function shows the interaction regions found in the chromosome10.
plotInteractionsPerChromosome(my3Cseq.obj,"chr10")
plotDomainogramNearViewpoint function shows the domainogram of interactions found in cis. This function may takes minutes to produce the domainogram plot.
plotDomainogramNearViewpoint(my3Cseq.obj,view="both")
## [1] "Analyzing interaction regions for each window...."
## [1] "Analyzing interaction regions for each window...."
getExpInteractionsInRefseq and getContrInteractionsInRefseq functions can be used to detect the list of genes that contain significant interaction signals in their proximity.
detected_genes<-getExpInteractionsInRefseq(my3Cseq.obj)
head(detected_genes)
## chromosome gene_name total_nReads total_RPMs
## 170 chr10 Myb 43217 20539.955
## 142 chr10 Hbs1l 34310 16031.869
## 28 chr10 Ahi1 29738 11912.850
## 134 chr4 Gm5506 15792 3939.331
## 32 chr10 Aldh8a1 9422 3633.383
## 136 chr4 Gm5801 13482 3350.087
export3Cseq2bedGraph function exports all interactions from the GRanges object to the bedGraph format, which can be uploaded to the UCSC genome browser.
export3Cseq2bedGraph(my3Cseq.obj)
## [1] "File Myb_prom_FL.RPMs.bedGraph.gz ' is created."
## [1] "File MYb_prom_FB.RPMs.bedGraph.gz ' is created."
generate3CseqReport function generates the summary report from r3Cseq analysis results. The report contains a pdf file for all plots and text files of interaction regions. This function may takes minutes to produce the report.
generate3CseqReport(my3Cseq.obj)
## [1] "Analyzing interaction regions for each window...."
## [1] "Analyzing interaction regions for each window...."
## [1] "File Myb_prom_FL.interaction.txt ' is created."
## [1] "File MYb_prom_FB.interaction.txt ' is created."
## [1] "File Myb_prom_FL.RPMs.bedGraph.gz ' is created."
## [1] "File MYb_prom_FB.RPMs.bedGraph.gz ' is created."
## [1] "Three files are generated : a pdf file of plots, a text file of interaction regions, and a bedGraph file."
The example of how to work with replicats is shown on http://r3cseq.genereg.net/ website.
http://r3cseq.genereg.net/ provides more details of r3Cseq analysis pipeline. The example data sets can be downloaded from the website.
Here is the output of sessionInfo on the system on which this document was compiled:
## R Under development (unstable) (2018-11-30 r75722)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] splines parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] BSgenome.Mmusculus.UCSC.mm9.masked_1.3.99
## [2] RSQLite_2.1.1
## [3] BSgenome.Mmusculus.UCSC.mm9_1.4.0
## [4] BSgenome_1.51.0
## [5] r3Cseq_1.29.0
## [6] qvalue_2.15.0
## [7] VGAM_1.0-6
## [8] rtracklayer_1.43.1
## [9] Rsamtools_1.35.0
## [10] Biostrings_2.51.1
## [11] XVector_0.23.0
## [12] GenomicRanges_1.35.1
## [13] GenomeInfoDb_1.19.1
## [14] IRanges_2.17.1
## [15] S4Vectors_0.21.6
## [16] BiocGenerics_0.29.1
## [17] BiocStyle_2.11.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 matrixStats_0.54.0
## [3] bit64_0.9-7 RColorBrewer_1.1-2
## [5] tools_3.6.0 R6_2.3.0
## [7] DBI_1.0.0 lazyeval_0.2.1
## [9] colorspace_1.3-2 tidyselect_0.2.5
## [11] bit_1.1-14 compiler_3.6.0
## [13] chron_2.3-53 Biobase_2.43.0
## [15] DelayedArray_0.9.0 bookdown_0.8
## [17] scales_1.0.0 stringr_1.3.1
## [19] digest_0.6.18 rmarkdown_1.11
## [21] pkgconfig_2.0.2 htmltools_0.3.6
## [23] rlang_0.3.0.1 bindr_0.1.1
## [25] BiocParallel_1.17.3 dplyr_0.7.8
## [27] RCurl_1.95-4.11 magrittr_1.5
## [29] GenomeInfoDbData_1.2.0 Matrix_1.2-15
## [31] Rcpp_1.0.0 munsell_0.5.0
## [33] proto_1.0.0 sqldf_0.4-11
## [35] stringi_1.2.4 yaml_2.2.0
## [37] SummarizedExperiment_1.13.0 zlibbioc_1.29.0
## [39] plyr_1.8.4 grid_3.6.0
## [41] blob_1.1.1 crayon_1.3.4
## [43] lattice_0.20-38 knitr_1.20
## [45] pillar_1.3.0 tcltk_3.6.0
## [47] reshape2_1.4.3 XML_3.98-1.16
## [49] glue_1.3.0 evaluate_0.12
## [51] data.table_1.11.8 BiocManager_1.30.4
## [53] gtable_0.2.0 purrr_0.2.5
## [55] assertthat_0.2.0 gsubfn_0.7
## [57] ggplot2_3.1.0 xfun_0.4
## [59] tibble_1.4.2 GenomicAlignments_1.19.0
## [61] memoise_1.1.0 bindrcpp_0.2.2
Soler, Eric, Charlotte Andrieu-Soler, Ernie de Boer, Jan Christian Bryne, Supat Thongjuea, Ralph Stadhouders, Robert-Jan Palstra, et al. 2010. “The genome-wide dynamics of the binding of Ldb1 complexes during erythroid differentiation.” Genes & Development 24 (3): 277–89.
Stadhouders, R, S Thongjuea, C Andrieu-Soler, RJ Palstra, JC Bryne, E De Boer, C Kockx, et al. 2012. “Dynamic long-range chromatin interactions control Myb proto-oncogene transcription during erythriod development.” EMBO 31: 986–99.
Thongjuea, S, R Stadhouders, F Grosveld, E Soler, and Boris Lenhard. 2013. “r3Cseq-an R package for the discovery of long-range genomic interactions with chromosome conformation capture and next-generation sequencing data.” Nucleic Acids Research 41.