Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors.

Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, Rozowsky J, Birney E, Bickel P, Snyder M, Gerstein M.
Genome biology. 2012;13(9):R48.
Abstract
BACKGROUND: Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors. RESULTS: As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions. CONCLUSIONS: Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.

Related data

Available data
enhancer annotations
File format
BED
Data summary
Enhancers are identified as binding active regions (BARs) outside promoter-proximal regulatory modules (PRMs) and more than 10kb away from any GENCODE genes and non-coding genes. A random forest model trained on chromatin accessibility and histone modification has been used to identify BARs, using randomly sampled TF binding regions and non-binding regions as positive and negative training sets. PRMs are identified with a similar model as that for BARs, using TF binding regions at transcription start site (TSS) as positive training set, and non-binding regions or regions far away from TSS or both as negative training sets. This model has been applied to 5 ENCODE cell lines: GM12878, K562, h1-HESC, HeLa-S3, and Hep-G2.
Available data
enhancer annotations
File format
BED
Data summary
An update of the list in 2014 using the same methodology.
Available data
HOT regions
File format
BED
Data summary
The ChIP-Seq signals from various experiments for transcription factors were grouped together into 100 bp bins. The co-binding of different factors were computed based on these signals. The HOT and LOT regions were defined as the top 1% of region-specific co-occurrence and the bottom 1% of non-zero degrees of region-specific co-occurrence.
Available data
connectivity
File format
BED
Data summary
The correlation (or anti-correlation) of histone marks at a putative enhancer and the expression of nearby genes was used to determine connectivity. This list was updated in 2014 based on the 2012 methodology. First 3 columns are coordinates of DRM, 4th column is target gene, 5th column in correlation, and 6th column is correlation of histone mark (annotation of the histone mark with high correlation.