RFECS: a random-forest based algorithm for enhancer identification from chromatin state.

Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, Ernst J, Kellis M, Ren B.
PLoS computational biology. 2013;9(3):e1002968.
Abstract
Transcriptional enhancers play critical roles in regulation of gene expression, but their identification in the eukaryotic genome has been challenging. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of cell types or chromatin marks have previously been investigated for this purpose, leaving the question unanswered whether there exists an optimal set of histone modifications for enhancer prediction in different cell types. Here, we address this issue by exploring genome-wide profiles of 24 histone modifications in two distinct human cell types, embryonic stem cells and lung fibroblasts. We developed a Random-Forest based algorithm, RFECS (Random Forest based Enhancer identification from Chromatin States) to integrate histone modification profiles for identification of enhancers, and used it to identify enhancers in a number of cell-types. We show that RFECS not only leads to more accurate and precise prediction of enhancers than previous methods, but also helps identify the most informative and robust set of three chromatin marks for enhancer prediction.

Related data

Available data
enhancer annotations
File format
TSV
Data summary
RFECS, a vector based random forest algorithm, was originally trained on 24 chromatin marks in H1 and IMR90. P300 binding sites that overlap DNase-I hypersensitive sites and distal to transcription start sites (TSS) are used as a positive training set. TSS that overlap DNase-I and random 100bp bins that are distal to known p300 or TSS are used as negative training set. They also showed that a minimal model trained just on H3K4me1, H3K4me3, and H3K27ac marks was sufficiently accurate and could be used for enhancer predictions across cell lines - i.e., models trained based on features in one cell line could be used to make predictions in another cell line. RFECS on H3K4me1, H3K4me3 and H3K27ac were applied to 12 ENCODE cell lines.