Data standards

Overview

The ENCODE consortium analyzes the quality of the data produced using a variety of metrics. Quality metrics for evaluating epigenomic assays are an active area of research; standards are emerging as more metrics are used with more datasets and types of experiments. The typical values for a quality metric can vary among different assays, or even among different features within the same assay, such as antibodies used in ChIP-seq experiments. Currently, there is no single measurement that identifies all high-quality or low-quality samples. As with quality control for other types of experiments, multiple assessments (including manual inspection of tracks) are useful because differing assessments may capture different concerns. Comparisons within an experimental method (e.g., comparing replicates to each other, comparing values for one antibody in several cell types, or comparing the same antibody and cell type between different labs) can help identify possible stochastic error.

Experimental guidelines

The ENCODE Consortium has adopted shared experimental guidelines for the most common ENCODE assays. The guidelines have evolved over time as technologies have changed, and current guidelines are informed by results gathered during the project. The ENCODE Consortium has also developed a set of antibody characterization standards to address the problems of specificity and reproducibility that are characteristic of antibody-based assays. Previous versions of all guidelines are archived and available for reference. 

Quality metrics

As part of the third phase of ENCODE, uniform analysis pipelines were developed for the major assay types, each of which produces a set of data quality metrics. Many of the software tools used for quality metric calculations can be found on the Software Tools page with their citations, while the Terms and Definitions page contains information on individual metrics. The ENCODE Consortium uses these measures to set standards detailing the criteria for excellent, passable, and poor data. On the ENCODE portal, data that do not meet the minimum cutoff values are flagged according to severity of the error; examples of errors include low read depth, poor replicate concordance, or low correlation. Metadata inconsistencies may also be flagged, such as missing biosample donor information or conflicts between the experiment information and the biosample information. Metadata is annotated using ontologies, which enhance interoperability and consistency between databases; terms that do not match the ontologies that ENCODE utilizes will be marked with an audit. Please visit the Audits page for a full list of flags and explanations for each.

Older standards for datasets published as part of the ENCODE integrative analysis publications in 2012 can be found on the quality metrics page associated with the publication.  

Standards

The links below contain information on experimental guidelines, requirements for processing on the uniform pipelines, and the application of quality metrics for each assay type. Please note that documentation is currently being updated and consolidated, and you may see changes over the next several weeks (Updated May 2017).

ChIP-seq

Long RNA-seq

RAMPAGE

microRNA-seq

ATAC-seq RNA Bind-N-Seq

WGBS

Small RNA-seq

eCLIP

microRNA counts

DNase-seq  

 

Click through for information on reference sequences for the Uniform Processing Pipelines. 

Updated May 2017