Chromatin

Ho et al., Comparative analysis of metazoan chromatin organization, Nature 512:449–452, 2014.

Genome function is dynamically regulated in part by chromatin, which consists of the histones, non-histone proteins and RNA molecules that package DNA. Studies in Caenorhabditis elegans (worm) and Drosophila melanogaster (fly) have contributed notably to our understanding of molecular mechanisms of genome function in humans, and have revealed conservation of chromatin components and mechanisms. Nevertheless, the three organisms have markedly different genome sizes, chromosome architecture and gene organization. On human and fly chromosomes, for example, pericentric heterochromatin flanks single centromeres, whereas worm chromosomes have dispersed heterochromatin-like regions enriched in the distal chromosomal ‘arms’, and centromeres distributed along their lengths. To systematically investigate chromatin organization and associated gene regulation across species, we generated and analysed a large collection of genome-wide chromatin data sets from cell lines and developmental stages in worm, fly and human. Here we present over 800 new data sets from our ENCODE and modENCODE consortia, bringing the total to over 1,400. Comparison of combinatorial patterns of histone modifications, nuclear lamina-associated domains, the organization of large-scale topological domains, the chromatin environment at promoters and enhancers, nucleosome positioning, and DNA replication patterns reveals many conserved features of chromatin organization among the three organisms. We also find notable differences in the composition and locations of repressive chromatin. These data sets and analyses provide a rich resource for comparative and species-specific investigations of chromatin composition, organization and function.

Overall Description

The data sets were generated by the modENCODE (model organism Encyclopedia of DNA Elements) and ENCODE consortia in 2007-2012, funded by the National Human Genome Research Institute (NHGRI). Please see http://www.genome.gov/modencode/ and http://www.genome.gov/encode/ for more information about the projects.

These data consist of ChIP-seq and ChIP-chip profiles for histone modifications and chromosomal proteins in fly, worm, and human, as well as several related data sets. The ChIP-chip datasets were produced on Affymetrix (fly) or NimbleGen (worm) arrays. The ChIP-seq datasets were generated on the Illumina sequencers.

For background on the ChIP-seq workflow by the consortium, please see Landt et al., ChIP-seq guidelines and practices used by the ENCODE and modENCODE consortia, Genome Research, 2012.

ENCODE-X Browser: We have developed a web application for theses chromatin data sets. The main advantage of our web application is that it allows one to quickly see what chromatin-related data are available using faceted browsing, and use the IGV borwser to view the data, for all three organisms.
Antibody Validation Database: Antibodies used in the project were rigorously tested, and this database contains the validation data. Please see Egelhofer et al., An assessment of histone-modification antibody quality, Nature Str & Mol Biology, 2011.
modENCODE Data Portal: This website also allows one to use faceted browsing to select datasets of interest (fly and worm only).
modMine: This warehouse by the modENCODE Data Coordinating Center contains a flexible query interface with access to extensive intermediate and meta-data (fly and worm only).
ENCODE data portal: This contains human and mouse ENCODE data.
Gene Expression Omnibus (GEO) and Short Read Archive (SRA): Raw data are available from these two sites. Links to specific data sets are available from the above sites.

Available Data

ChIP-seq and ChIP-chip data
Input normalized ChIP-seq and ChIP-chip fold enrichment profiles
hiHMM chromatin state tracks
Heterochromatin domains
Lamina Associated Domains (LADs)
Hi-C defined topological domains
Chromatin-based inferred topological domains and their boundaries
Enhancers
Gene anotation, gene expression data, and human-worm-fly ortholg map
Genomic sequence mappability tracks
Worm TSS definition based on capRNA-seq (capTSS)
Other genomic features: GC-content and PhastCons scores

ChIP-seq and ChIP-chip data

This table contains a complete listing and detailed meta-data for all chromatin date sets, including links to the source files.

View as a Google Spreadsheet or download as an Excel file.

Top

Input normalized ChIP-seq and ChIP-chip fold enrichment profiles

The input normalized profiles are availabe at ENCODE-X Browser. The procedure for normalization of these profiles is as follows:

ChIP-seq

To enable the cross-species comparisons described in this paper, we have processed all data uniformly across all species using MACS. For every pair of aligned ChIP and matching input-DNA data, we used MACS version 2 to generate fold enrichment signal tracks for every position in a genome:

macs2 callpeak -t ChIP.bam -c Input.bam -B --nomodel --shiftsize 73 --SPMR -g hs -n ChIP

macs2 bdgcmp -t ChIP_treat_pileup.bdg -c ChIP_control_lambda.bdg -o ChIP_FE.bedgraph -m FE

ChIP-chip

For the fly data, genomic DNA Tiling Arrays v2.0 (Affymetrix) were used to hybridize ChIP and input DNA. We obtained the log-intensity ratio values (M-values) for all perfect match (PM) probes: M = log2(ChIP intensity) - log2(input intensity), and performed a whole-genome baseline shift so that the mean of M in each microarray is equal to 0. The smoothed log intensity ratios were calculated using LOWESS with a smoothing span corresponding to 500 bp, combining normalized data from two replicate experiments. For the worm data, a custom Nimblegen two-channel whole genome microarray platform was used to hybridize both ChIP and input DNA. MA2C was used to preprocess the data to obtain a normalized and median centered log2 ratio for each probe.

Top

hiHMM chromatin state tracks

We performed joint chromatin state segmentation on the human, fly, and worm ChIP-seq histone modification data using a hierarchically linked infinite hidden Markov model (hiHMM). The software and associated documentation is accessible here.

Top

Heterochromatin domains

To identify broad H3K9me3+ heterochromatin domains, we first identified broad H3K9me3 enrichment region using SPP (Kharchenko et al.), based on methods get.broad.enrichment.cluster with a 10 kb window for fly and worm and 100 kb for human. Then regions that are less than 10 kb of length were removed. The remaining regions were identified as the heterochromatin regions.

Top

Lamina Associated Domains (LADs)

Genomic coordinates of LADs were directly obtained from their original publications, for worm (Ikegami et al.), fly (van Bemmel et al.) and human (Guelen et al.). We converted the genomic coordinates of LADs to ce10 (for worm), dm3 (for fly) and hg19 (for human) using UCSC’s liftOver tool with default parameters.

Top

Hi-C defined topological domains

The data were downloaded from Dixon et al., “Topological domains in mammalian genomes identified by
analysis of chromatin interactions”, Nature (2012) (for human embryonic stem cells) and Sexton et al., “Three-dimensional folding and functional organization principles of the Drosophila genome”, Cell (2012) (for fly late embryos). The human coordinates were originally in hg18. We used UCSC’s liftOver tool to convert the coordinates to hg19. Here are the genomic coordinates used in our study:

There is no known published Hi-C data for worm.

Top

Chromatin-based inferred topological domains and their boundaries

Based on the observation that each Hi-C-defined topological domain is usually uniformly enriched for similar chromatin-states, we tested the idea of whether correlation of histone modifications between different chromosomal regions (within each chromosome) could be used to infer topological domains and their boundaries. Here are the inferred domain definition and the boundaries.

fly (dm3) LE: boundary score, boundary call, and domain call
fly (dm3) L3: boundary score, boundary call, and domain call
worm (ce10) EE: boundary score, boundary call, and domain call
worm (ce10) L3: boundary score, boundary call, and domain call

Top

Enhancers

Enhancers were identified using TSS-distal DHSs and p300 and CBP-1 binding sites. The positions listed in the files are a subset of TSS-distal DHSs (human, fly), p300 (human) and CBP-1 (worm) binding sites that are classified as enhancers.

Enhancer Sites (tar.gz file)

The classification was optimized to obtain a high confidence set that is not necessarily very inclusive. For additional information please see the README file in the archive.

Top

Gene anotation, gene expression data, and human-worm-fly ortholg map

Data and the description of these data can be found at the modENCODE/ENCODE transcriptome page.

Top

Genomic sequence mappability tracks

We generated empirical genomic sequence mappability tracks using input-DNA sequencing data. After merging input reads up to 100M, reads were extended to 149 bp which corresponds to the shift of 74 bp in signal tracks. The union set of empirically mapped regions was obtained. They are available here:

In addition to this empirically derived genome-wide sequence mappability tracks, we could also compare them with known unassembled genomic regions by considering the “Gap” table from the UCSC genome browser:

human (hg19) (234 Mb of known unassembled regions)
fly (dm3) (search for chr*_gap.txt.gz) (6.3 Mb of known unassembled regions)

There are no known unassembled regions in worm.

Top

Worm TSS definition based on capRNA-seq (capTSS)

We obtained worm TSS definition based on capRNA-seq from Chen et al. “The landscape of RNA polymerase II transcription initiation in C. elegans reveals promoter and enhancer architectures”.

Briefly, short 5’-capped RNA from total nuclear RNA of mixed stage embryos were sequenced (i.e., capRNA- seq) by Illumina GAIIA (SE36) with two biological replicates. Reads from capRNA-seq were mapped to WS220 reference genome using BWA. Transcription initiation regions (TICs) were identified by clustering of capRNA-seq reads. In this analysis we used TICs that overlap with wormbase TSSs within -199:100bp. We refer these capRNA-seq defined TSSs as capTSS in this study.

Top

Other genomic features: GC-content and PhastCons scores

GC content

We downloaded the 5bp GC% data from the UCSC genome browser annotation download page for human (hg19), fly (dm3), and worm (ce10). Centering at every 5 bp bin, we calculated the running median of the GC% of the surrounding 100 bp (i.e., 105 bp in total). GC scores were then binned into 10 bp (fly and worm) or 50 bp (human) non-overlapping bins.

PhastCons scores

PhastCons conservation score was obtained from the UCSC genome browser annotation download page. Specifically, we used the following score for each species.

PhastCons scores were then binned into 10 bp (fly and worm) or 50 bp (human) non-overlapping bins.

Top