Integrative Analysis of ENCODE Data

Introduction

The major goal of the ENCODE project is to identify all functional elements in the human genome sequence, where functional element is defined as a discrete region of the genome that encodes a reproducible biochemical signature. ENCODE data production groups generate data and submit the data to the ENCODE Data Coordinating Center (DCC) for quality control and release. A cross-consortium effort to perform integrated analysis of all the data types to generate useful integrative data interpretations for the community has come to completion. The results of these analyses have been published as the ENCODE integrative analysis publication package. This page describes a series of resources associated with the integrative analysis of ENCODE data.

Analysis tools

ENCODE analysis virtual machine

The supplementary information for the ENCODE integrative analysis Nature publication includes a set of code bundles that provide the scripts and processing steps corresponding to the methodology used in the analyses associated with the paper. The analysis group has established an ENCODE virtual machine instance of the software, using the code bundles, where each analysis program has been tested and run. The virtual machines are freely available for interested parties to use to work with the data and tools used in the integrative analysis.

Software tools

A page describing the software tools used in the ENCODE project is provided at ENCODE portal.

Data standards and quality metrics

As part of the integrative analysis, the ENCODE project has established a number of standards. Details of each set of standards is available at the following pages:

A detailed description of the ChIP-seq standards is provided in the publication: Landt, et al, ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012 Sep;22(9):1813-31.
Uniform guidelines for the most common ENCODE experiments are described in the experiment guidelines. ENCODE datasets are collected using a variety of technologies, such as ChIP-seq and RNA-seq.
The consortium has undertaken several efforts to characterize these platforms to better understand the data being collected using them. These efforts are summarized on the platform characterization page.
Quality metrics that have been developed to enable comparison and standardized processing of the data included in these publications are described on the quality metrics page for these data.

Data

Data Coordination Center resources

All ENCODE data used for these publications, like all production data generated by the ENCODE consortium, is submitted to the DCC. Data is reviewed for quality and released to the scientific community. The DCC maintains the ENCODE portal providing access to this data.

Analysis data hub

The integrative analysis process has been a distributed effort by many groups. Individual analysts downloaded and processed files from the ENCODE download site, and created intermediate and final analysis products in various forms. Now that the analysis has been completed, the analysis data is being made available for viewing and downloading through a UCSC public data hub. This data hub includes descriptions of ENCODE data in uniformly processed signal and element representations, as well as genome segmentations. The ENCODE downloads page includes an Analysis Hub section that provides access to files on the hub. Click here to visualize the ENCODE Integrative Analysis Data Hub in the UCSC Genome Browser.

Analysis FTP site

Access to the analysis products are also provided via anonymous FTP from the EBI ENCODE analysis FTP server. This site contains an organized file structure with the ENCODE analysis datasets located in subdirectories within the byDataType directory.