Common File Formats Used by the ENCODE Consortium

Overview

The ENCODE consortium uses several file formats to store, display, and disseminate data:

  • FASTQ: a text-based format for storing nucleotide sequences (reads) and their quality scores. [1]
  • BAM: The Sequence Alignment/Mapping (SAM) format is a text-based format for storing read alignments against reference sequences and it is interconvertible with the binary BAM format. [2]
  • bigWig: The bigWig format is an indexed binary format for rapid display of continuous and dense data in the UCSC Genome Browser.
  • bigBed: The bigBed format is also an indexed binary format for rapid display of annotation items such as a linked collection of exons or the binding peaks of a transcription factor.
  • hic: The hic format is a binary format for storing contact matrices and annotations of chromatin structural features generated from Hi-C or other proximity mapping assays.

These file formats were originally designed to be generic and flexible. As the ENCODE consortium is a collaborative effort, the consortium has made several specifications on the file formats to facilitate data archival, presentation, and distribution, as well as integrative analysis on the data. The consortium considers FASTQ as the basic file format for archival purpose and thus the FASTQ format's specifications aim to preserve the raw sequence data. In comparison, the other file formats are geared towards data visualization and dissemination, thus their specifications aim to facilitate user-friendliness. Additional information about the file formats can be viewed at the UCSC Genome Browser ENCODE-specific File Formats.

FASTQ

FASTQ file content

FASTQ files are submitted as they come off the sequencing instrument to allow for maximal decision making of downstream users. The files are accompanied by documentation detailing how the sequencing libraries were constructed to inform the end-user about how they might want to process the data, the strengths and limitations of the various options of data processing, and how these may apply according to the user's biological questions of interest.

ENCODE produces replicate data for most experiments to quantify reliability. Biological replicates involve different biological samples, e.g., different tissue preparations for cell growth and expansion when cell lines are used. Biological replicates are contrasted with technical replicates, for which different sequencing libraries are prepared from the same sample, or different sequencing lanes for the same library. Reads from different replicates are stored in separate files and should include flow cell and lane ID. If multiple lanes are used for the same biological or technical replicate, they are stored in the same file (after a QC check to eliminate failed lanes), with information on flow cell and lane ID included. For experiments that produce paired-end reads, the two reads in each pair are stored in two separate files, with the reads in the same order in the two files.

The reads in FASTQ files are unfiltered, i.e., barcodes, adapter sequences, and spike-ins remain in the files. For Illumina sequencing, the barcodes that are in the so-called third read position should not be present in the sequence. Spike-in reads are kept. For bisulfite sequencing experiments, the raw FASTQ files are presented, wherein most unmethylated cytosines are converted to thymines.

Reads are not "clipped" (no bases are removed). For example, in the case of small RNAs that are shorter than the read-length, there may be adapters flanking these reads—these adapter sequences remain in the FASTQ file. Some libraries are constructed in a way such that the barcode is read out in the sequence (CSHL small RNAs were made this way during phase II of ENCODE) and will appear in the FASTQ. Even though these barcodes would need to be trimmed off prior to mapping, they are still included in the FASTQ file because different users may choose different trimming algorithms.

FASTQ sequencing quality

FASTQ uses four lines for each sequence with the fourth line denoting the sequencing quality in each position. The consortium reports the Phred quality score from 0 to 93 using ASCII 33 to 126, i.e., Phred score plus 33. This is used by the newest versions of the Illumina pipeline, Sanger and SRA. The Phred score of a base[3][4] is defined as -10 log10 (e) where e is the estimated probability for a base to be erroneous.

Introductory information on the FASTQ format

BAM

BAM file content

When sequence reads are mapped to reference sequences, the resulting alignments are stored in BAM files (SAMtools are used to convert between SAM and BAM files). Mapping algorithms (e.g., Bowtie, BWA, STAR, etc.) use many parameters, such as the version of a reference genome, the total number of mismatches allowed during mapping, the maximal number of times a read is allowed to map to the reference, etc. Furthermore, SAMtools may change the content while converting a SAM file to a BAM file. For example, the user may allow both unique-mapping and multiple-mapping reads during the mapping and then decide to retain only unique-mapping reads in the BAM file. Therefore the consortium documents the parameters used by the mapping algorithm and SAMtools in the header of the BAM files.

For experiments that generate paired-end reads, the paired reads are stored in the same BAM file. The consortium also retains unmapped reads and spike-ins (whenever appropriate). Because spike-in reads are "non-chromosomal", they need to be filtered out before downstream processing. The quality scores for unmapped reads are stored in the same format as in FASTQ files, i.e., Phred+33. Biological replicates are stored in separate BAM files. Multiple lanes of the same library are pooled into a single BAM, with read names containing lane information so that it is possible to decompose the pooled BAM file into individual BAM files by lane.

At the present time, the consortium only releases one BAM file for each FASTQ file (or for each pair of FASTQ files in the case of paired-end datasets). However in the future the consortium may allow multiple BAM files for the same FASTQ file, potentially for different mapping algorithms (e.g., Bowtie and STAR for RNA-seq data) or for mapping to different reference genomes (e.g., personalized genomes). In that case, the consortium will provide clear guidelines for usage.

BAM mapping parameters

Due to the diverse data types used in ENCODE, the choice of mapping algorithms and the parameters used are data type dependent. These parameters include how many mismatches are allowed, whether seed matching is used (only the prefix of each read is used for mapping while the low-quality suffix is discarded), whether reads that map to many locations in the reference are allowed, etc. The consortium aims to specify the settings of these parameters for each individual data type, and these specifications will be released in the future versions of this document. Nonetheless, the settings of all tunable parameters are specified in the header of each BAM file.

Introductory information on the SAM/BAM format

bigWig

bigWig file content

In order to visualize the number of reads that are mapped to a reference genome as a continuous signal in the UCSC genome browser, a user can convert a BAM file to a bigWig file (via the intermediate bedGraph format, using computer programs provided by the UCSC Genome Browser).

Stranded data are stored in two bigWig files, one file for the plus genomic strand and the other file for the minus strand. The data on the two strands are displayed as two separate UCSC tracks by default and can also be displayed in different colors as a single overlayed track (without changing the two bigWig files). For unstranded data, signals on the plus and minus strands are summed and only one bigWig file is needed.

Data from biological replicates are stored in individual bigWig files and can be viewed as separate UCSC tracks; however, this may cease to be necessary after the user has concluded that the replicates are highly reproducible. Thus the consortium also provides one bigWig file for each experiment with the reads in all biological replicates pooled and used this file to define the default track for the experiment.

Generation of bigWig files

To facilitate the comparison across datasets, ENCODE bigWig files are automatically generated by ENCODE uniform processing pipelines which contain appropriate parameters for data normalization and filtering. The version and key parameters of the pipeline that have been used to generate a bigWig file are provided.

Introductory information on the bigWig format

bigBed

bigBed file content

Analyses of ENCODE data produce annotation files, e.g., genomic regions that are enriched in ChIP-seq signal of transcription factors (ChIP-seq peaks), splice junctions detected using RNA-seq data, or differentially methylated regions detected using bisulfite sequencing data. Such annotation files can be visualized in the UCSC genome browser using the bigBed format. Related to and interconvertible with the text-based Bed format, bigBed is an indexed binary format designed for rapid visualization. For each element in an ENCODE bigBed file, the consortium specifies its chromosome, start, end, genomic strand when applicable, and a color score that denotes average signal enrichment for the region.

Introductory information on the bigBed format

hic

hic file content

Proximity mapping assays, like Hi-C, measure long-range chromatin interactions between pairs of loci in the genome. Reads are sequenced and aligned to a reference genome using the juicer and hictools software to generate hic files, which are highly compressed binary files containing the contact matrices and normalization vectors at several different resolutions. These files include annotations of chromatin structural features, including loops, loop anchor motifs, contact domains, and compartments[5]. From the ENCODE portal, hic files are visualized using the Juicebox software. See ENCSR545YBD as an example.

Introductory information on the hic format

References

[1] Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variantsNucleic Acids Res. 2010 Apr;38(6):1767-71. PMID: 20015970; PMC: PMC2847217

[2] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtoolsBioinformatics. 2009 Aug 15;25(16):2078-9. PMID: 19505943; PMC: PMC2723002

[3] Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessmentGenome Res. 1998 Mar;8(3):175-85. PMID: 9521921

[4] Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilitiesGenome Res. 1998 Mar;8(3):186-94. PMID: 9521922

[5] Durand NC, Shamim MS, Machol I, Rao SSP, Huntley MH, Lander ES, Aiden EL. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Systems. 2016 Jul;3: 95-98. PMID: 27467249