The ENCODE defined processing pipelines are designed to create high-quality, consistent, and reproducible data. Pipelines are composed of discrete steps that can represent an algorithm, a software tool, or a file format manipulation. These steps are applied to the primary data (generated from an experimental assay) to produce visualizable data. The ENCODE Data Coordinating Center has developed data processing pipelines for major assay types generated by the project: RNA-seq (RNA-sequencing) for long and short RNA insert sizes, RAMPAGE1 for transcription start sites, ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) against transcription factors and histone modifications, and WGBS (whole-genome bisulfite sequencing) for DNA methylation analysis.
All data processing pipeline code is available from the ENCODE DCC github, and the pipelines can be run interactively from a featured project on the DNAnexus cloud-computing platform. To learn more, select a specific pipeline. Links to archived pipelines are listed on each pipeline page.
A processing pipeline is a set of analysis steps that may be versioned as changes are made to the code and software components. Entire pipelines may also be versioned.
There are major and minor step revisions: Minor step revisions are backwards compatible and should produce directly comparable results; these are annotated as versions of the step. Major step revisions result in a new pipeline version, though not all steps will change when a pipeline is versioned. Whenever a major change is made, all downstream steps must be versioned as well, as the inputs to downstream steps are dependent on the output of the new upstream steps. To see if an analysis step has a new version, or if the entire pipeline has a new version, click on the blue step boxes found in pipeline graph. These step boxes are also shown in the experiment pages.
An important goal motivating the development of uniform processing pipelines is to maximize the degree to which data can be compared within and across assays. All data should be processed by directly comparable methods, and all result files of a given type (e.g. alignment bams) should be compatible. If older versions of results were released but new analysis steps were later adopted, an experiment may have two versions of the same file once the data is reprocessed.
RNA-seq measure RNA abundance, and RNA-seq data can be interpreted in terms of transcriptional activity and RNA stability. RNA-seq experiments contribute to our understanding of how RNA-based mechanisms impact gene regulation and thus disease and phenotypic variation. Since RNA populations are diverse, different assays are optimized to measure different RNA species, and the data from these assays are processed in specific ways. The ENCODE Consortium has developed the following pipelines:
- RNAs longer than 200 bp: for mRNAs (poly-A(+)), rRNA-depleted total RNA, or poly-A(-) RNA populations
- RNAs shorter than 200 bp: for mRNAs (poly-A(+)), rRNA-depleted total RNA, or poly-A(-) RNA populations
- miRNA-seq: for microRNAs, around 22nt long, that are quantified from RNA-seq data
- miRNA counts: a complement to miRNA-seq
RAMPAGE (RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression) is a very accurate sequencing approach to identify transcription start sites (TSSs) at base-pair resolution, to quantify their expression, and to characterize their transcripts. RAMPAGE uses direct cDNA evidence to link specific genes and their TSSs.
ChIP-seq combines chromatin immunoprecipitation with DNA sequencing to identify the binding sites of chromatin-associated proteins. The ENCODE consortium has developed two distinct pipelines in order to study two different classes of protein-DNA interactions. Transcription factor ChIP-seq (TF ChIP-seq) specifically looks at proteins, such as sequence-specific transcription factors, which are thought to associate with specific DNA sequences to influence the rate of transcription. In these assays, the IP target is expected to bind in a rather punctuate pattern, and the distribution of sequencing reads is expected to be sharply biased in favor of genomic locations where the target is bound in chromatin, as compared to unenriched genomic DNA or a mock IP control. Histone ChIP-seq is sensitive to the histone content of chromatin, specifically to the incorporation of particular post-translational histone modifications in chromatin. Taken together, the distribution of several histone modifications can be interpreted in terms of distinct chromatin states that represent distinct modes of gene regulation.
The pipelines take input fastqs from replicated experiments and controls as well as reference fasta's for the initial read mapping. Both piplines share the same mapping steps, but differ in the way the signal and peaks are called and in the subsequent statistical treatment of replicates.
DNA accessibility assays such as DNase-seq, ATAC-seq, FAIRE-seq, and MNase-seq are common assays that support the goals of the ENCODE project. DNase-seq maps DNase I hypersensitive sites, which is considered to be an accurate method of identifying regulatory elements. ATAC-seq (Assay for Transposase Accessible Chromatin with high-throughput sequencing) is viewed as an alternative to DNase-seq and MNase-seq; it probes DNA accessibility with hyperactive Tn5 transposase, which inserts sequencing adapters into accessible regions of chromatin. The consortium has developed two kinds of pipelines: DNase-seq and ATAC-seq.
- DNase-seq Pipelines
- Single-end pipelines
- Paired-end pipelines
- ATAC-seq Pipeline
The ENCODE WGBS pipeline is used to discover methylation patterns at single-base resolution. Bisulfite treatment is used
to convert unmethylated cytosines into uracils, but leaves methylated cytosines unchanged. After mapping bisulfite sequencing reads against a C-->U transformed genome, this pipeline can extract the CpG, CGH and CHH methylation patterns genome-wide.