Data Processing Pipelines
The ENCODE Data Coordinating Center Uniform Processing Pipelines are designed to create high-quality, consistent, and reproducible data. Pipelines are composed of discrete steps that can represent an algorithm, a software tool, or a file format manipulation. These steps are applied to the primary data (generated from an experimental assay) to produce visualizable data. The ENCODE Data Coordinating Center has developed data processing pipelines for major assay types generated by the project: RNA-seq, RAMPAGE1, ChIP-seq, DNase-seq, ATAC-seq2 , and WGBS.
All data processing pipeline code is available from the ENCODE DCC github, and the pipelines can be run interactively from a featured project on the DNAnexus cloud-computing platform. To learn more, select a specific pipeline.
A processing pipeline is a set of analysis steps that may be versioned as changes are made to the code and software components. Entire pipelines may also be versioned.
There are major and minor step revisions: Minor step revisions are backwards compatible and should produce directly comparable results; these are annotated as step versions. Major step revisions result in a new pipeline version, though not all steps will change when a pipeline is versioned. Whenever a major change is made, all downstream steps must be versioned as well, as the inputs to downstream steps are dependent on the output of the new upstream steps. To visualize if a pipeline or an analysis step has a new version, click on the blue step boxes found in pipeline graph.
An important goal motivating the development of uniform processing pipelines is to maximize the degree to which data can be compared within and across assays. All data should be processed by directly comparable methods, and all result files of a given type (e.g. alignment bams) should be compatible. If older versions of results were released but new analysis steps were later adopted, an experiment may have two versions of the same file once the data is reprocessed.
RNA-seq measure RNA abundance, and RNA-seq data can be interpreted in terms of transcriptional activity and RNA stability. RNA-seq experiments contribute to our understanding of how RNA-based mechanisms impact gene regulation and thus disease and phenotypic variation. Since RNA populations are diverse, different assays are optimized to measure different RNA species, and the data from these assays are processed in specific ways.
- RNAs longer than 200 bp: for mRNAs (poly-A(+)), rRNA-depleted total RNA, or poly-A(-) RNA populations
- RNAs shorter than 200 bp: for mRNAs (poly-A(+)), rRNA-depleted total RNA, or poly-A(-) RNA populations
- miRNA-seq: for microRNAs, around 22nt long, that are quantified from RNA-seq data
- miRNA counts: a complement to miRNA-seq
RAMPAGE (RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression) is a very accurate sequencing approach to identify transcription start sites (TSSs) at base-pair resolution, to quantify their expression, and to characterize their transcripts. RAMPAGE uses direct cDNA evidence to link specific genes and their TSSs.
Transcription factor ChIP-seq (TF ChIP-seq) specifically looks at proteins, such as sequence-specific transcription factors, which are thought to associate with specific DNA sequences to influence the rate of transcription. Histone ChIP-seq is sensitive to the histone content of chromatin, specifically to the incorporation of particular post-translational histone modifications in chromatin. The pipelines take input fastqs from replicated experiments and controls as well as reference fasta's for the initial read mapping. Both pipelines share the same mapping steps, but differ in the way the signal and peaks are called and in the subsequent statistical treatment of replicates.
- Histone ChIP-seq Pipelines
- ENCODE 3 pipelines
- ENCODE 4 pipelines
- Transcription Factor ChIP-seq Pipeline
- ENCODE 3 ChIP-seq Mapping Pipeline
DNA accessibility assays such as DNase-seq, ATAC-seq, FAIRE-seq, and MNase-seq are common assays that support the goals of the ENCODE project. DNase-seq maps DNase I hypersensitive sites, which is considered to be an accurate method of identifying regulatory elements. ATAC-seq (Assay for Transposase Accessible Chromatin with high-throughput sequencing) is viewed as an alternative to DNase-seq and MNase-seq; it probes DNA accessibility with hyperactive Tn5 transposase, which inserts sequencing adapters into accessible regions of chromatin.
- DNase-seq Pipelines
- DNase-seq pipeline
- ENCODE 3 pipelines: Single-ended, Paired-ended
- ENCODE 4 Data Standards and Documentation
- ATAC-seq Pipelines
Whole-genome bisulfite sequencing (WGBS) is used to discover methylation patterns at single-base resolution. Bisulfite treatment is used to convert unmethylated cytosines into uracils, but leaves methylated cytosines unchanged. After mapping bisulfite sequencing reads against a C-->U transformed genome, this pipeline can extract the CpG, CGH and CHH methylation patterns genome-wide.
- WGBS Pipelines
3D-chromatin pipelines such as HiC and ChIA-PET are assays that support the goal of ENCODE Project in mapping the organization of the human genome and to examine the spatial proximity of chromosomal loci. HiC assay provides the three-dimensional architecture of whole genomes by coupling proximity-based ligation with massively parallel sequencing. ChIA-PET provides comprehensive maps of long-range chromatic interactions between structural and functional elements in genomes.
1 Batut, P., & Gingeras, T. R. (2013). Rampage: Promoter Activity Profiling by Paired-end Sequencing of 5′-complete cDNAs. Current Protocols in Molecular Biology / Edited by Frederick M. Ausubel ... [et Al.], 104, Unit–25B.11. http://doi.org/10.1002/0471142727.mb25b11s104
2 Buenrostro, J., Wu, B., Chang, H., & Greenleaf, W. (2015). ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current Protocols in Molecular Biology / Edited by Frederick M. Ausubel ... [et Al.], 109, 21.29.1–21.29.9. http://doi.org/10.1002/0471142727.mb2129s109