Whole-Genome Bisulfite Sequencing Data Standards and gemBS-based Processing Pipeline
Whole Genome Bisulfite Sequencing is used to investigate DNA methylation patterns to base granularity. A bisulfite treatment converts cytosines into uracils, but leaves methylated cytosines unchanged. After mapping bisulfite sequencing reads against a GemBS-transformed genome, the pipeline extracts the CpG, CHG, and CHH methylation patterns genome-wide. CpG sites are palindromic, consisting of a cytosine nucleotide followed by a guanine, and are highly prone to methylation. Methylation increases the possibility of mutation in which cytosine is deaminated and becomes thymine. Methylation can also occur at CHG and CHH sites, where "H" is A, C, or T (Guo et al. Characterizing the strand-specific distribution of non-CpG methylation in human pluripotent cells, Nucleic Acids Res (2014) 42 (5): 3009-3016).
Updated May 2021
The WGBS pipeline was developed as a part of the ENCODE Uniform Processing Pipelines series 1. The full WGBS pipeline code is available on Github and can be run on various platforms via Caper. It uses gemBS2 for alignment and methylation extraction.
View the current schematic
Information contained in file
|Gzipped DNA-sequencing reads||Reads must meet the criteria outlined under the Uniform Processing Pipeline Restrictions.|
|tar||genome index||Collection of index files used by gemBS 2|
|tsv||chromosome sizes||chromosome sizes for the mapping assembly|
Information contained in file
Produced by mapping reads to the genome
|bigWig||CpG sites coverage||Read coverage at CpG sites|
|bigWig||plus strand methylation state at CpG||Percent methylation at plus-strand CpG sites|
|bigWig||minus strand methylation state at CpG||Percent methylation at minus-strand CpG sites|
|bed9+ and bigBed9+||methylation state at CpG||Percent methylation at CpG sites (see Description of bed9+ file below this table).||CpG is a sequence in which a cytosine is followed by a guanine.|
|bed9+ and bigBed9+||methylation state at CHG||Percent methylation at CHG sites (see Description of bed9+ file below this table).||CHG is a sequence in which a cytosine and a guanine are separated by an adenosine, a cytosine, or a thymine.|
bed9+ and bigBed9+
methylation state at CHH
Percent methylation at CHH sites (see Description of bed9+ file below this table).
CHH is a sequence in which a cytosine is followed by two nucleotides that may be any of adenosine, cytosine, or thymine.
Quality control metrics are collected from the Samtools and gemBS software suites, and Pearson correlation is calculated from the two replicates' methylation states at CpG.
Description of bed9+ file
The bed9+file contains the number of reads and the percent methylation. The first 12 columns correspond to the ENCODE bedMethyl file format and the last 3 contain information about the genotype at the position.
Each column represents the following:
- Reference chromosome or scaffold
- Start position in chromosome
- End position in chromosome
- Name of item
- Score from 0-1000. Capped number of reads
- Strandedness, plus (+), minus (-), or unknown (?)
- Start of where display should be thick
- End of where display should be thick
- Color value (RGB)
- Coverage, or number of reads
- Percentage of reads that show methylation at this position in the genome
- Reference genotype
- Sample genotype
- Quality score for genotype call
View the mapping assemblies (includes lambda genome for generation of comparative statistics)
View the gemBS reference files used in this pipeline
Links and Publications
Find data generated by the WGBS pipeline
Explore WGBS-related publications on the portal here
1. Tsuji, Junko, and Zhiping Weng. "Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data." Briefings in bioinformatics 17.6 (2015): 938-952.
2. Merkel, Angelika et. al. "gemBS: high throughput processing for DNA methylation data from bisulfite sequencing" Bioinformatics 35.5 (2019): 737-742.
Uniform Processing Pipeline Restrictions
- Each replicate should have 30X coverage.
- The read length should be a minimum of 100 base pairs.
- Sequencing may be paired- or single-ended, as long as sequencing type is specified and paired sequences are indicated
- The sequencing platform must be indicated.
- Barcodes, if present in fastq, must be indicated.
- The pipeline maps against the lambda genome as a method of control.
- Alignment files are mapped to either the GRCh38 or mm10 sequences.
Experimental guidelines for WGBS experiments can be found here.
Experiments should have two or more biological replicates; they may have two technical replicates per biological replicate. Assays performed using EN-TEx samples may be exempted due to limited availability of experimental material.
- The C to T conversion rate should be ≥98%
- The CpG quantification should have a Pearson correlation of ≥0.8 for sites with ≥10X coverage.
- Sequencing may be paired- or single-ended, as long as sequencing type is specified and paired sequences are indicated.
- The experiment must pass routine metadata audits in order to be released.