ENCODE Encyclopedia, Version 2 (Archived): Overview

Introduction

The ENCODE Project provides a set of candidate genomic regions that can serve as predictions for further investigation. This page provides links to visualize, search, and download a set of genomic annotations as well as a list of publications that contain additional data.

Annotated genomic regions

Annotations for human ENCODE data are as follows. An ENCODE query tool can search either human or mouse data. Additional annotations for mouse ENCODE data will be presented in a future release. 

  • Candidate enhancers and promoters for DNase hypersensitivity, annotated with histone marks H3K27ac and H3K4me1 which are enriched at enhancers, H3K4me3 which is enriched at promoters, H3K9ac which is enriched at both enhancers and promoters, as well as ChIP peaks of transcription factors. Out of 177 cell types with DNase-seq data, we annotated 45 cell types with H3K27ac, 48 cell types with H3K4me1, 94 cell types with H3K4me3, and 27 cell types with H3K9ac in a cell type specific manner.   [Download methods ]
     
    Click to visualize tracks at UCSC Genome Browser or the WashU browser

Data from the Common fund- supported Roadmap Epigenomics Mapping Consortium (REMC) were included in this analysis. Please see the 2015 paper on their analysis of reference human genomes for more information. 

    • Distal DNase peaks [Download]
    • Proximal DNase peaks [Download]
    • Distal H3K27ac annotations (cell type specific) [Download]
    • Distal H3K4me1 annotations (cell type specific) [Download]
    • Distal H3K9ac annotations (cell type specific) [Download]
    • Proximal H3K4me3 annotations (cell type specific) [Download]
    • Proximal H3K9ac annotations (cell type specific) [Download]
    • Distal TF binding sites [Download]
    • Proximal TF binding sites [Download]
  • Gene expression over ~60 human cell types with genes annotated by GENCODE 19. The ENCODE query tool can search human or mouse gene expression. [ENCODE query tool | Visualize data | Download data | Download methods]
  • Transcription start site (TSS) lists [View README]
    • GENCODE v19 TSS [Download]
    • GENCODE v19 TSS stratified by strict Fantom5 CAGE clusters [Download]
    • GENCODE v19 TSS stratified by robust Fantom5 CAGE clusters [Download]
    • GENCODE v19 TSS stratified by permissive Fantom5 CAGE clusters [Download]

Additional annotations

Papers previously published by the ENCODE Consortium contain data files that include additional genomic annotations. Search for all publications with ENCODE element data.

Peaks

Peaks are enriched regions of the genome corresponding to either sites of transcription factor binding or DNase hypersensitivity identified during various functional genomic assays. In this section, we provide a list of peaks in various cell lines using both DNase-Seq and ChIP-Seq assays. View publications.

RNAs

RNA represents the direct readout of the genetic information encoded by genomes and a significant proportion of a cell’s regulatory capabilities are focused on its synthesis, processing, transport, modification and translation. A catalogue of the RNA species made inside the cell and the amount of RNA from each of these loci across various cell lines is provided in this section. View publications.

Promoters

The promoter is the region proximal to the transcription start site of a gene that regulates its transcription using transcription factor binding sites. These transcription factors recruit RNA polymerase after binding to the promoter and initiate transcription of the gene. View publications.

Enhancers (predicted from supervised machine learning methods)

An enhancer is a regulatory DNA sequence where transcription factors bind in order to regulate the transcription of an associated gene. Enhancers are typically distant from the transcription start site of a gene and can either be upstream or downstream of the start of a gene. The activity of enhancers behave in a tissue and developmental time point specific manner. This section contains enhancers predicted using various supervised machine learning methods. View publications.

Semi-automated genome annotations

Semi-automated genome annotation (SAGA) methods take as input a heterogeneous collection of genomic data derived from a particular cell type and use machine learning methods to simultaneously partition the genome and assign labels to the resulting segments. The labels are assigned so that two genomic segments with the same label exhibit similar patterns in the associated data. The procedure is "semi-automated" because regions with a given label are then manually compared with known annotations in order to assign more intuitive labels such as enhancer, active promoter, gene body, etc. The result is an annotation of the entire genome in each cell line, taking into account diverse properties such as chromatin accessibility, patterns of histone modifications, transcription factor binding, etc. A SAGA method thus summarizes a very large collection of data into a more meaningful form. View publications.

HOT regions

High Occupancy of Target (HOT) regions are compact genome regions in which a large number of different transcription-related factors bind. HOT regions have higher density of binding than would be expected by chance using a 5% significance threshold. They have been shown to play an important role in gene regulation. View publications.

Connectivity

Many TFs bind to promoter or enhancer regions of a gene, thereby activating or repressing the expression of that target gene. The targets of a TF are predicted using different methods in a cell-line specific fashion. This connectivity pattern can then be studied in the form of gene regulatory networks. View publications.

Motifs

The DNA binding sites of transcription related factors (or their complexes) typically have short consensus sequences that are typically 4 to 30 base pairs long. These sites called motifs and can be identified using TF binding sites identified by ChIP-Seq. View publications.

Other

Miscellaneous elements from publications on the ENCODE consortium. This includes allele-specific SNPs in the genome identified from RNA-Seq and ChIP-Seq. View publications