Data Processing Pipelines

Overview

The ENCODE Data Coordinating Center Uniform Processing Pipelines are designed to create high-quality, consistent, and reproducible data. Pipelines are composed of discrete steps that can represent an algorithm, a software tool, or a file format manipulation. These steps are applied to the primary data (generated from an experimental assay) to produce visualizable data. The ENCODE Data Coordinating Center has developed data processing pipelines for major assay types generated by the project: RNA-seq, RAMPAGE¹, ChIP-seq, DNase-seq, ATAC-seq², and WGBS.

All data processing pipeline code is available from the ENCODE DCC github. To learn more, select a specific pipeline.

RNA-seq pipelines | RAMPAGE pipeline Chromatin Immunoprecipitation pipelines DNA accessibility pipelines | DNA methylation pipeline | 3D-chromatin pipelines

Several of these pipelines (ChIP-seq, ATAC-seq, RNA-seq, long read RNA-seq, microRNA-seq, WGBS and Hi-C) and their WDL workflows have been deposited to Dockstore. Dockstore provides an interface to execute the ported pipelines on various platforms (such as DNAnexus:, Terra, AnVIL). Five of the pipelines (ChIP-seq, ATAC-seq, RNA-seq, long read RNA-seq, and microRNA-seq) have been ported to the Truwl (Discover Workflows ) bioinformatics platform, and two (ChIP-seq and ATAC-seq) are available on the Seven Bridges platform

Pipeline versioning

A processing pipeline is a set of analysis steps that may be versioned as changes are made to the code and software components. Entire pipelines may also be versioned.

There are major and minor step revisions: Minor step revisions are backwards compatible and should produce directly comparable results; these are annotated as step versions. Major step revisions result in a new pipeline version, though not all steps will change when a pipeline is versioned. Whenever a major change is made, all downstream steps must be versioned as well, as the inputs to downstream steps are dependent on the output of the new upstream steps. To visualize if a pipeline or an analysis step has a new version, click on the blue step boxes found in pipeline graph.

An important goal motivating the development of uniform processing pipelines is to maximize the degree to which data can be compared within and across assays. All data should be processed by directly comparable methods, and all result files of a given type (e.g. alignment bams) should be compatible. If older versions of results were released but new analysis steps were later adopted, an experiment may have two versions of the same file once the data is reprocessed.

RNA-seq pipelines

RNA-seq measure RNA abundance, and RNA-seq data can be interpreted in terms of transcriptional activity and RNA stability. RNA-seq experiments contribute to our understanding of how RNA-based mechanisms impact gene regulation and thus disease and phenotypic variation. Since RNA populations are diverse, different assays are optimized to measure different RNA species, and the data from these assays are processed in specific ways.

RNAs longer than 200 bp: for mRNAs (poly-A(+)), rRNA-depleted total RNA, or poly-A(-) RNA populations
- ENCODE 3 pipelines
RNAs shorter than 200 bp: for mRNAs (poly-A(+)), rRNA-depleted total RNA, or poly-A(-) RNA populations
- ENCODE 3 pipeline
  - Single-ended, stranded only
  - Data Standards and Documentation
bulk RNA-seq: for total RNA-seq, poly(A)+ RNA-seq, poly(A)- RNA-seq, CRISPR RNA-seq, CRISPRi RNA-seq, shRNA knockdown RNA-seq, and siRNA knockdown RNA-seq.
- ENCODE3 pipeline
- ENCODE 4 pipeline
  - Single-ended or paired-ended, stranded or unstranded
  - Data Standards and Documentation
miRNA-seq: for microRNAs, around 22nt long, that are quantified from RNA-seq data
- ENCODE 3 pipeline
  - Single-ended
  - Data Standards and Documentation
- ENCODE 4 pipeline
  - Single-ended
  - Data Standards and Documentation

miRNA counts: a complement to miRNA-seq
- ENCODE 3 pipeline
  - pipeline
  - Data Standards and Documentation

RAMPAGE pipeline

RAMPAGE (RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression) is a very accurate sequencing approach to identify transcription start sites (TSSs) at base-pair resolution, to quantify their expression, and to characterize their transcripts. RAMPAGE uses direct cDNA evidence to link specific genes and their TSSs.

RAMPAGE Pipeline
- ENCODE 3 pipeline
  - Paired-ended
  - Data Standards and Documentation

Chromatin Immunoprecipitation pipelines

Transcription factor ChIP-seq (TF ChIP-seq) specifically looks at proteins, such as sequence-specific transcription factors, which are thought to associate with specific DNA sequences to influence the rate of transcription. Histone ChIP-seq is sensitive to the histone content of chromatin, specifically to the incorporation of particular post-translational histone modifications in chromatin. The pipelines take input fastqs from replicated experiments and controls as well as reference fasta's for the initial read mapping. Both pipelines share the same mapping steps, but differ in the way the signal and peaks are called and in the subsequent statistical treatment of replicates.

Histone ChIP-seq Pipelines
- ENCODE 3 pipelines
- ENCODE 4 pipelines
Transcription Factor ChIP-seq Pipeline
- ENCODE 3 pipelines
- ENCODE 4 pipelines
ENCODE 3 ChIP-seq Mapping Pipeline

DNA accessibility pipelines

DNA accessibility assays such as DNase-seq, ATAC-seq, FAIRE-seq, and MNase-seq are common assays that support the goals of the ENCODE project. DNase-seq maps DNase I hypersensitive sites, which is considered to be an accurate method of identifying regulatory elements. ATAC-seq (Assay for Transposase Accessible Chromatin with high-throughput sequencing) is viewed as an alternative to DNase-seq and MNase-seq; it probes DNA accessibility with hyperactive Tn5 transposase, which inserts sequencing adapters into accessible regions of chromatin.

DNase-seq Pipelines
- ENCODE 3 pipelines:
  - Single-ended
  - Paired-ended
- ENCODE 4 pipelines:
  - Paired-ended or single-ended
  - Data Standards and Documentation
ATAC-seq Pipelines
- ENCODE 4 pipelines

DNA methylation pipeline

Whole-genome bisulfite sequencing (WGBS) is used to discover methylation patterns at single-base resolution. Bisulfite treatment is used to convert unmethylated cytosines into uracils, but leaves methylated cytosines unchanged. After mapping bisulfite sequencing reads against a C-->U transformed genome, this pipeline can extract the CpG, CGH and CHH methylation patterns genome-wide.

WGBS Pipelines
- ENCODE 3 pipelines
  - Single-ended
  - Paired-ended
- ENCODE 4 pipelines
  - Single-ended or paired-ended
  - Data Standards and Documentation

3D-chromatin pipeline

3D-chromatin pipelines such as HiC and ChIA-PET are assays that support the goal of ENCODE Project in mapping the organization of the human genome and to examine the spatial proximity of chromosomal loci. HiC assay provides the three-dimensional architecture of whole genomes by coupling proximity-based ligation with massively parallel sequencing. ChIA-PET provides comprehensive maps of long-range chromatic interactions between structural and functional elements in genomes.

HiC pipeline
- ENCODE4 pipeline
  - Paired-ended or single ended, stranded or unstranded
  - Data Standards and Documentation
ChIA-PET pipeline
- ENCODE4 pipeline
  - Paired-ended
  - Data Standards and Documentation

References

1 Batut, P., & Gingeras, T. R. (2013). Rampage: Promoter Activity Profiling by Paired-end Sequencing of 5′-complete cDNAs. Current Protocols in Molecular Biology / Edited by Frederick M. Ausubel ... [et Al.], 104, Unit–25B.11. http://doi.org/10.1002/0471142727.mb25b11s104

2 Buenrostro, J., Wu, B., Chang, H., & Greenleaf, W. (2015). ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current Protocols in Molecular Biology / Edited by Frederick M. Ausubel ... [et Al.], 109, 21.29.1–21.29.9. http://doi.org/10.1002/0471142727.mb2129s109