The data sets were generated by the modENCODE (model organism Encyclopedia of DNA Elements) and ENCODE consortia in 2007-2012, funded by the National Human Genome Research Institute (NHGRI). Please see http://www.genome.gov/modencode/ and http://www.genome.gov/encode/ for more information about the projects.
These data consist of ChIP-seq and ChIP-chip profiles for histone modifications and chromosomal proteins in fly, worm, and human, as well as several related data sets. The ChIP-chip datasets were produced on Affymetrix (fly) or NimbleGen (worm) arrays. The ChIP-seq datasets were generated on the Illumina sequencers.
Related papers and websites
For background on the ChIP-seq workflow by the consortium, please see Landt et al., ChIP-seq guidelines and practices used by the ENCODE and modENCODE consortia, Genome Research, 2012.
- ENCODE-X Browser: We have developed a web application for theses chromatin data sets. The main advantage of our web application is that it allows one to quickly see what chromatin-related data are available using faceted browsing, and use the IGV borwser to view the data, for all three organisms.
- Antibody Validation Database: Antibodies used in the project were rigorously tested, and this database contains the validation data. Please see Egelhofer et al., An assessment of histone-modification antibody quality, Nature Str & Mol Biology, 2011.
- modENCODE Data Portal: This website also allows one to use faceted browsing to select datasets of interest (fly and worm only).
- modMine: This warehouse by the modENCODE Data Coordinating Center contains a flexible query interface with access to extensive intermediate and meta-data (fly and worm only).
- ENCODE data portal: This contains human and mouse ENCODE data.
- Gene Expression Omnibus (GEO) and Short Read Archive (SRA): Raw data are available from these two sites. Links to specific data sets are available from the above sites.
- ChIP-seq and ChIP-chip data
- Input normalized ChIP-seq and ChIP-chip fold enrichment profiles
- hiHMM chromatin state tracks
- Heterochromatin domains
- Lamina Associated Domains (LADs)
- Hi-C defined topological domains
- Chromatin-based inferred topological domains and their boundaries
- Gene anotation, gene expression data, and human-worm-fly ortholg map
- Genomic sequence mappability tracks
- Worm TSS definition based on capRNA-seq (capTSS)
- Other genomic features: GC-content and PhastCons scores
This table contains a complete listing and detailed meta-data for all chromatin date sets, including links to the source files.
The input normalized profiles are availabe at ENCODE-X Browser. The procedure for normalization of these profiles is as follows:
To enable the cross-species comparisons described in this paper, we have processed all data uniformly across all species using MACS. For every pair of aligned ChIP and matching input-DNA data, we used MACS version 2 to generate fold enrichment signal tracks for every position in a genome:
macs2 callpeak -t ChIP.bam -c Input.bam -B --nomodel --shiftsize 73 --SPMR -g hs -n ChIP macs2 bdgcmp -t ChIP_treat_pileup.bdg -c ChIP_control_lambda.bdg -o ChIP_FE.bedgraph -m FE
For the fly data, genomic DNA Tiling Arrays v2.0 (Affymetrix) were used to hybridize ChIP and input DNA. We obtained the log-intensity ratio values (M-values) for all perfect match (PM) probes: M = log2(ChIP intensity) - log2(input intensity), and performed a whole-genome baseline shift so that the mean of M in each microarray is equal to 0. The smoothed log intensity ratios were calculated using LOWESS with a smoothing span corresponding to 500 bp, combining normalized data from two replicate experiments. For the worm data, a custom Nimblegen two-channel whole genome microarray platform was used to hybridize both ChIP and input DNA. MA2C was used to preprocess the data to obtain a normalized and median centered log2 ratio for each probe.
We performed joint chromatin state segmentation on the human, fly, and worm ChIP-seq histone modification data using a hierarchically linked infinite hidden Markov model (hiHMM). The software and associated documentation is accessible here.
- human (hg19) - H1-hESC
- human (hg19) - GM12878
- fly (dm3) - LE
- fly (dm3) - L3
- worm (ce10) - EE
- worm (ce10) - L3
To identify broad H3K9me3+ heterochromatin domains, we first identified broad H3K9me3 enrichment region using SPP (Kharchenko et al.), based on methods get.broad.enrichment.cluster with a 10 kb window for fly and worm and 100 kb for human. Then regions that are less than 10 kb of length were removed. The remaining regions were identified as the heterochromatin regions.
Genomic coordinates of LADs were directly obtained from their original publications, for worm (Ikegami et al.), fly (van Bemmel et al.) and human (Guelen et al.). We converted the genomic coordinates of LADs to ce10 (for worm), dm3 (for fly) and hg19 (for human) using UCSC’s liftOver tool with default parameters.
The data were downloaded from Dixon et al., “Topological domains in mammalian genomes identified by
analysis of chromatin interactions”, Nature (2012) (for human embryonic stem cells) and Sexton et al., “Three-dimensional folding and functional organization principles of the Drosophila genome”, Cell (2012) (for fly late embryos). The human coordinates were originally in hg18. We used UCSC’s liftOver tool to convert the coordinates to hg19. Here are the genomic coordinates used in our study:
There is no known published Hi-C data for worm.
Based on the observation that each Hi-C-defined topological domain is usually uniformly enriched for similar chromatin-states, we tested the idea of whether correlation of histone modifications between different chromosomal regions (within each chromosome) could be used to infer topological domains and their boundaries. Here are the inferred domain definition and the boundaries.
- fly (dm3) LE: boundary score, boundary call, and domain call
- fly (dm3) L3: boundary score, boundary call, and domain call
- worm (ce10) EE: boundary score, boundary call, and domain call
- worm (ce10) L3: boundary score, boundary call, and domain call
Enhancers were identified using TSS-distal DHSs and p300 and CBP-1 binding sites. The positions listed in the files are a subset of TSS-distal DHSs (human, fly), p300 (human) and CBP-1 (worm) binding sites that are classified as enhancers.
- Enhancer Sites (tar.gz file)
The classification was optimized to obtain a high confidence set that is not necessarily very inclusive. For additional information please see the README file in the archive.
Data and the description of these data can be found at the modENCODE/ENCODE transcriptome page.
We generated empirical genomic sequence mappability tracks using input-DNA sequencing data. After merging input reads up to 100M, reads were extended to 149 bp which corresponds to the shift of 74 bp in signal tracks. The union set of empirically mapped regions was obtained. They are available here:
In addition to this empirically derived genome-wide sequence mappability tracks, we could also compare them with known unassembled genomic regions by considering the “Gap” table from the UCSC genome browser:
- human (hg19) (234 Mb of known unassembled regions)
- fly (dm3) (search for chr*_gap.txt.gz) (6.3 Mb of known unassembled regions)
There are no known unassembled regions in worm.
We obtained worm TSS definition based on capRNA-seq from Chen et al. “The landscape of RNA polymerase II transcription initiation in C. elegans reveals promoter and enhancer architectures”.
Briefly, short 5’-capped RNA from total nuclear RNA of mixed stage embryos were sequenced (i.e., capRNA- seq) by Illumina GAIIA (SE36) with two biological replicates. Reads from capRNA-seq were mapped to WS220 reference genome using BWA. Transcription initiation regions (TICs) were identified by clustering of capRNA-seq reads. In this analysis we used TICs that overlap with wormbase TSSs within -199:100bp. We refer these capRNA-seq defined TSSs as capTSS in this study.
We downloaded the 5bp GC% data from the UCSC genome browser annotation download page for human (hg19), fly (dm3), and worm (ce10). Centering at every 5 bp bin, we calculated the running median of the GC% of the surrounding 100 bp (i.e., 105 bp in total). GC scores were then binned into 10 bp (fly and worm) or 50 bp (human) non-overlapping bins.
PhastCons conservation score was obtained from the UCSC genome browser annotation download page. Specifically, we used the following score for each species.
PhastCons scores were then binned into 10 bp (fly and worm) or 50 bp (human) non-overlapping bins.