Whole-Genome Bisulfite Sequencing Data Standards and gemBS-based Processing Pipeline

Assay Overview

Whole Genome Bisulfite Sequencing is used to investigate DNA methylation patterns to base granularity. A bisulfite treatment converts cytosines into uracils, but leaves methylated cytosines unchanged. After mapping bisulfite sequencing reads against a GemBS-transformed genome, the pipeline extracts the CpG, CHG, and CHH methylation patterns genome-wide. CpG sites are palindromic, consisting of a cytosine nucleotide followed by a guanine, and are highly prone to methylation. Methylation increases the possibility of mutation in which cytosine is deaminated and becomes thymineMethylation can also occur at CHG and CHH sites, where "H" is A, C, or T (Guo et al. Characterizing the strand-specific distribution of non-CpG methylation in human pluripotent cells, Nucleic Acids Res (2014) 42 (5): 3009-3016).

Updated May 2021

Pipeline Overview

The WGBS pipeline was developed as a part of the ENCODE Uniform Processing Pipelines series 1. The full WGBS pipeline code is available on Github and can be run on various platforms via Caper. It uses gemBS2 for alignment and methylation extraction.

Pipeline Schematic

View the current schematic 

Inputs:

File format

Information contained in file

File description

Notes

fastq

reads

Gzipped DNA-sequencing reads Reads must meet the criteria outlined under the Uniform Processing Pipeline Restrictions.
tar genome index Collection of index files used by gemBS 2  
tsv chromosome sizes chromosome sizes for the mapping assembly  

Outputs:

File format

Information contained in file

File description

Notes

bam

alignments 

Produced by mapping reads to the genome

 
bigWig CpG sites coverage Read coverage at CpG sites  
bigWig plus strand methylation state at CpG Percent methylation at plus-strand CpG sites  
bigWig minus strand methylation state at CpG Percent methylation at minus-strand CpG sites  
bed9+ and bigBed9+ methylation state at CpG Percent methylation at CpG sites (see Description of bed9+ file below this table). CpG is a sequence in which a cytosine is followed by a guanine.
bed9+ and bigBed9+ methylation state at CHG Percent methylation at CHG sites (see Description of bed9+ file below this table). CHG is a sequence in which a cytosine and a guanine are separated by an adenosine, a cytosine, or a thymine.

bed9+ and bigBed9+

methylation state at CHH

Percent methylation at CHH sites (see Description of bed9+ file below this table).

CHH is a sequence in which a cytosine is followed by two nucleotides that may be any of adenosine, cytosine, or thymine.

Quality control metrics are collected from the Samtools and gemBS software suites, and Pearson correlation is calculated from the two replicates' methylation states at CpG.

Description of bed9+ file

The bed9+file contains the number of reads and the percent methylation. The first 12 columns correspond to the ENCODE bedMethyl file format and the last 3 contain information about the genotype at the position.

Each column represents the following:

  1. Reference chromosome or scaffold
  2. Start position in chromosome
  3. End position in chromosome
  4. Name of item
  5. Score from 0-1000. Capped number of reads
  6. Strandedness, plus (+), minus (-), or unknown (?)
  7. Start of where display should be thick
  8. End of where display should be thick
  9. Color value (RGB)
  10. Coverage, or number of reads
  11. Percentage of reads that show methylation at this position in the genome
  12. Reference genotype
  13. Sample genotype
  14. Quality score for genotype call

References

Genomic References

View the mapping assemblies (includes lambda genome for generation of comparative statistics)
View the gemBS reference files used in this pipeline

Links and Publications

Find data generated by the WGBS pipeline
Explore WGBS-related publications on the portal here

1. Tsuji, Junko, and Zhiping Weng. "Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data." Briefings in bioinformatics 17.6 (2015): 938-952.

2. Merkel, Angelika et. al. "gemBS: high throughput processing for DNA methylation data from bisulfite sequencing" Bioinformatics 35.5 (2019): 737-742.

Uniform Processing Pipeline Restrictions

  • Each replicate should have 30X coverage.
  • The read length should be a minimum of 100 base pairs.
  • Sequencing may be paired- or single-ended, as long as sequencing type is specified and paired sequences are indicated
  • The sequencing platform must be indicated.
  • Barcodes, if present in fastq, must be indicated.
  • The pipeline maps against the lambda genome as a method of control.
  • Alignment files are mapped to either the GRCh38 or mm10 sequences.

Current Standards

Experimental guidelines for WGBS experiments can be found here.

  • Experiments should have two or more biological replicates; they may have two technical replicates per biological replicate. Assays performed using EN-TEx samples may be exempted due to limited availability of experimental material.

  • The C to T conversion rate should be ≥98%
  • The CpG quantification should have a Pearson correlation of ≥0.8 for sites with ≥10X coverage.
  • Sequencing may be paired- or single-ended, as long as sequencing type is specified and paired sequences are indicated.
  • The experiment must pass routine metadata audits in order to be released.