DNase-seq Data Standards and Processing Pipeline
DNaseI-seq, or DNase-seq as it is referred to on the ENCODE portal, is a global and high-resolution method that uses the non-specific endonuclease DNaseI to map chromatin accessibility. These accessible regions, designated as DNaseI hypersensitive sites (DHSs), define the regulatory features, (eg. promoters, enhancers, insulators, locus control regions) of complex genomes. DNase-seq is an unbiased and robust method that is not predicated on an a priori understanding of regulatory patterns or chromatin features.
Updated Dec 2020
The ENCODE 4 DNase-seq pipeline was developed in collaboration with the Stamatoyannopoulos lab as a part of the ENCODE Uniform Processing Pipelines series. The pipeline for processing DNase I sequencing data takes in Illumina sequencing reads from DNase I experiments, assess the data, and returns genomic regions with statistically significant enrichments, or 'hotspots', of cleavage activity from the original DNase I experiment. The full DNase-seq pipeline code is available on Github.
View the current pipeline instance.
Information contained in file
|G-zipped reads, paired-ended or single-ended.||Multiple FASTQs from a single biological replicate or library are concatenated before mapping|
|fasta||genome indices, genome reference||Indices are dependent on the assembly being used for mapping|
|List of expected chromosomes|
|bed, starch||hotspots reference||Reference files used for hotspots and peaks calling||Hotspots reference must be selected guided by the input FASTQ read length|
|txt||bias model||Reference file used for footprints modeling|
Information contained in file
|bam||unfiltered alignments||FASTQs are concatenated, trimmed for adapters and length, and aligned with BWA|
|bam||alignments||unfiltered alignments are filtered and deduplicated|
|bigWig||read-depth normalized signal||Density starch from Hotspot2 is normalized and converted to a bigWig|
|bed and bigBed narrowPeak||peaks||DNase I hypersensitivity sites identified using Hotspot2||Both 5% and 0.1% FDR peak calls are included in the outputs. FDR stands for False Discovery Rate.|
|bed and bigBed bed3+||
FDR cut rate
|FDR estimate per site for being called hypersensitive to cleavage by DNase I|
|bed and bigBed bed3+||
|A footprint model is fit and statistical deviations from expected cleavage rates within hotspot regions are called and thresholded||
1% FDR footprints.
A detailed explanation of the footprint modeling is available in the footprint-tools documentation.
|DNase I hypersensitivity sites identified using Hotspot2||0.1% FDR peaks|
View the reference files used in this pipeline
Links and Publications
Find data generated by the ENCODE 4 DNase-seq pipeline
Experimental guidelines for DNase experiments can be found here.
- Experiments should have two or more biological replicates, isogenic or anisogenic. Assays performed using EN-TEx samples or other rare types may be exempted due to limited availability of experimental material.
- A SPOT (Signal Portion of Tags) score of 0.4 or higher is considered a product of high quality data.
- A SPOT score of 0.25 is considered minimally acceptable for rare and hard to find primary tissues. In very rare cases of limited sample availability, lower scoring data may be used with appropriate caution.
- Any sample with a SPOT score <0.3 should be targeted for replacement with a higher quality sample, whenever possible.
- SPOT scores should be calculated on de-duplicated data or on data for which the duplicate rates are <5%.
- DNase-seq requires a minimum of 20 million uniquely mapping reads to generate a reliable SPOT score, and 100 million uniquely mapping reads to generate reliable DNase footprints.
- For a standard conventional DNase-seq profile, 50 million uniquely mapping reads are recommended.
- For deep, footprinting depth DNase-seq, a depth of 150-200 million uniquely mapping paired-end reads are recommended.
- Acceptable mappability rates and mitochondrial content are listed below:
|SPOT Score*||Mitochondrial Fraction||Mappability
(without mitochondrial fraction)
|Resulting Data Status|
|≥ 0.4||< 10%||> 75%||Good|
|0.3 - 0.4||10 - 15%||65 - 75%||Acceptable|
|< 0.25||> 15%||< 65%||Poor|
|* SPOT scores of 0.25 - 0.3 are minimally acceptable in certain circumstances (see above explanation)|
- Replicate concordance: the gene level quantification should have a Pearson correlation of >0.9 between isogenic replicates and >0.85 between anisogenic replicates.
- The experiment must pass routine metadata audits in order to be released.
Uniform Processing Pipeline Restrictions
- The read length should be a minimum of 36 base pairs.
- Read trimming of adapter sequences is recommended.
- Failure to trim sequences of fragments sized shorter than the sequencing cycle number can lead to filtering small fragment signal and create bias in the resulting alignments.
- There is no post-trimming read length minimum, but effective fragment mapping drops significantly at kmer-lengths of less than 22 base pairs.
- Adapter sequences used in library creation should be documented and available to the pipeline.
- Barcodes/UMI coding should be indicated in the metadata and available for accurate application of duplication filtering methods by the pipeline.
- Sequencing may be paired- or single-end, as long as sequencing type is specified and read pairs are indicated. Paired-end sequencing is preferred.
- The sequencing platform used must be indicated.
- Alignment files are mapped to either the GRCh38 or mm10 sequences.