Audits

What are audits?

​The ENCODE Data Coordination Center has implemented a system of audits, or flags, to provide additional information to the research community about the quality of the data on the portal. These flags may indicate an error in the experimental metadata, or may indicate that the data itself does not meet some aspect of the Consortium's standards. The flag's color corresponds to the severity of the problem.    

Read Coverage Audits

Audit message Explanation
Control insufficient read depth
Control low read depth

A ChIP-seq control should be sequenced to similar depth as the experiment. For this experiment, the control depth is insufficient.

Extremely low read depth
Insufficient read depth
Low read depth
The read depth of the alignment files is below the given threshold for the assay type. For a list of standards for individual assay types, see the following links:
WGBS
DNase-seq
Bulk RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP
Extremely low coverage
Insufficient coverage
Low coverage
Each biological replicate processed by the WGBS paired-end pipeline is recommended to have 30X coverage. Replicates with coverage between 25X - 30X receive a "Low coverage" audit; replicates with coverage between 5X - 25X receive an "Insufficient coverage" audit; and replicates with coverage less than 5X receive an "Extremely low coverage" audit.
Insufficient number of aligned reads
Borderline number of aligned reads
The alignments file for a micro RNA-seq experiment did not have a sufficient number of aligned reads. For more information on data standards, please visit the micro-RNA-seq page.

Replication Audits

Audit message Explanation
Insufficient replicate concordance
Low replicate concordance
Borderline replicate concordance

The value for the metric used to measure replicate concordance is below the given threshold for the assay type, so the experiment has been flagged for poor reproducibility. For a list of standards for individual assay types, see the following links:

WGBS
DNase-seq
Bulk RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP

Unreplicated experiment ENCODE experiments (excluding ENTEx/GTEx) are required to have at least two biological replicates. Experiments using samples from the GTEx consortium do not require more than one replicate because of the limited availability of tissues.  
Inconsistent replicate A file came from a replicate that belongs to a different experiment from the one in which the file is found, or the replicate numbers do not match between parent and derived files. 
Technical replicates with not identical biosample Two technical replicates do not share the same biosample. 
Replicate with no library The library created from the experimental replicate was not uploaded and/or attached to the replicate.
Insufficient number of reproducible peaks
Moderate number of reproducible peaks
The ATAC-seq dataset lacks sufficient numbers of reproducible peaks in the overlap peaks or IDR thresholded peaks files.

Library Complexity Audits

Audit message Explanation
Severe bottlenecking
Mild to moderate bottlenecking
Bottlenecking for ChIP-seq assays is measured using PCR Bottlenecking Coefficients 1 and 2 (PBC1 and PBC2). For more information on the expected values, see the ChIP-seq standards page.
Insufficient library complexity
Moderate library complexity
Poor library complexity
Library complexity for ChIP-seq experiments is measured using the Non-Redundant Fraction. For more information on the expected values for NRF, see the ChIP-seq standards page

Enrichment Audits

Audit Message Explanation

Extremely low SPOT score
Insufficient SPOT score
Low SPOT score

The SPOT (Signal Portion of Tags) score is a measure of enrichment used in DNase-seq experiments. For more information on the expected values, please see the DNase-seq standards page.

Extremely low TSS enrichment
Moderate TSS enrichment
Low TSS enrichment

Transcription Start Site (TSS) enrichment values for alignments in an ATAC-seq assay are below standards. The ideal value is > 7, with 5-7 being acceptable and < 5 being non compliant. For more information on the expected values, please see the ATAC-seq standards page.

Low FRiP score

FRiP (fraction of reads in called peak regions) for overlap peak files in an ATAC-seq assay are below standards. The ideal value is >0.3, with 0.2-0.3 being acceptable and <0.2 being non compliant. For more information on the expected values, please see the ATAC-seq standards page.

Uniform Pipeline Requirements

Audit message Explanation
Missing spikeins Bulk RNA-seq, shRNA knockdown, and CRISPR editing followed by RNA-seq assays require spike-ins, but they are missing in the given experiment.
Missing RNA fragment size The library created for an RNA-seq experiment lacks information on the size of the library fragments.
Missing input control ChIP-seq experiments must have at least one input control, but the control given is not an input control. For example, it may be a mock immunoprecipitation instead.
Missing run_type Fastq file does not contain information on the run type used to produce it (single vs paired end).
Inconsistent control read length
Inconsistent control run type
Inconsistent control platform

The uniform pipelines expect that the controls and the files they control share identical run_types and read lengths. Otherwise, trimming may be required. Similarly, the pipelines expect the same or comparable sequencing platforms.

Inconsistent platforms

The uniform pipelines expect files within an experiment to have been produced from the same or comparable sequencing platforms.

Mixed read lengths
Mixed run types

A single experiment contains fastq files with different read lengths and/or run types, either within or among replicates.

Extremely low read length
Insufficient read length
Low read length
The sequencing read length is below the given threshold for the assay type. For a list of standards for individual assay types, see the following links:
WGBS
DNase-seq
Bulk RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP
Non-standard run_type

The pipelines for specific assay types require either single-end or paired-end sequencing, but the required sequencing type was not performed. For a list of standards for individual assay types, see the following links:
WGBS
DNase-seq
Bulk RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP

Not compliant platform The sequencing platform used is not compatible with the processing pipeline. For a list of standards for individual assay types, see the following links:
WGBS
DNase-seq
Bulk RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP

Antibody Audits

Audit message Explanation
Duplicate lane review The flagged antibody has already been reviewed for the biosample type in question. For example, there may be multiple lanes in a single western blot that are characterizing the antibody is K562. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays
Not tagged antibody
Inconsistent target
The antibody and experimental targets of interest do not match. This may be because the experimental target is tagged and the antibody does not apply to that tag, or because the target proteins are completely different between the antibody and the experiment. 
Mismatched tag target

The antibody target and experiment target are not the same. This may be a metadata error and should be clarified by the lab and the DCC if they are meant to be the same, or if the experiment is using a different antibody with the proper target.

Not eligible antibody The antibody used in the experiment is not eligible for use because it has not been fully characterized in the biosample type (e.g. liver tissue or K562) used by the experiment.
Partially characterized antibody The antibody used in the experiment has either its primary or secondary characterization (but not both) for the given biosample (e.g. liver tissue or K562) used by the experiment.
Uncharacterized antibody The antibody used in the experiment is lacking a primary and secondary characterization for the given biosample (e.g. liver tissue or K562) used by the experiment.
Antibody not characterized to standard The antibody used in the experiment has non-compliant characterizations, and no compliant characterizations for in the biosample type (e.g. liver tissue or K562) used by the experiment.
Antibody characterized with exemption

The antibody used in the assay did not pass its primary characterization test, but the secondary characterization was able to rescue the primary and it passed with exemption. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays

Characterizations not reviewed

The antibody has old characterizations, perhaps from previous iterations of ENCODE, that were not reviewed or submitted for review. 

No characterizations submitted

The antibody lacks any attempt at characterization. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays

No primary characterizations

The antibody does not have any attempt at primary characterization in accordance with the ENCODE antibody characterization standards. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays

No secondary characterizations

The antibody does not have any attempt at secondary characterization in accordance with the ENCODE antibody characterization standards. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays

Need compliant primaries

Any and all attempts at primary characterization of this antibody do not meet the standards.

Need compliant secondaries

Any and all attempts at secondary characterization of this antibody do not meet the standards.

Metadata Audits

Audit message Explanation

Missing antibody
Missing biosample
Missing biosample_term_id
Missing biosample_term_name
Missing biosample type
Missing donor
Missing target
Missing documents

If an assay type requires any of the following, but the required property was not provided, the experiment is given a flag: antibody, ontology name or ID for the assay type, biosample used to make the library, the ontology term or ID for the biosample, the type of biosample (e.g. immortalized cell line or tissue), the biosample’s donor, the target or molecule of interest, the transfection type (i.e. stable or transient), or protocol documents.
Multiple paired_with The raw files are the product of paired ended sequencing, and the file in question has been marked as paired with more than one other file, which is not allowed. 
Missing raw data in replicate Each experimental replicate's library must have a corresponding raw sequencing file, such as a fastq.
Missing derived_from A processed file should have information on the files from which it was derived. For example, an alignment file should indicate which raw data files and references indices were used to create it.
Missing control alignments A peaks file must specify the control alignments used to generate it.

Missing possible controls,
Missing possible controls
Mismatched control
Missing controlled_by

ChIP-seq, RAMPAGE, and CAGE experiments all require controls. A flag appears if the control is missing or has a different biosample type from the experiment it controls (e.g. K562 versus MCF-7). A flag will also appear if the control files are not matched to corresponding experimental files; this information is stored in a property called "controlled_by" in the experimental file object. Please note that the "missing possible controls" assay has different flag colors depending on the assay type and the project phase. 

Missing genotype
Missing external identifiers

Biosample donors for worms and flies (Caenorhabditis and Drosophila) must have their genotypes listed in accordance with the nomenclature rules (for fly, for worm), and must have external references (e.g. GEO/SAMN IDs) listed. 

Inconsistent ontology term The ontology term does not match the ontology ID provided. 
Inconsistent depleted_in_term length
Depleted_in length mismatch
Some tissue type was removed from the biosample before library creation. The list of ontology term names of the tissues removed does not match the list of the ontology term IDs in either the biosample or the library. 
Inconsistent organism
Inconsistent donor
Inconsistent library biosample
Inconsistent age
Inconsistent sex
The biosamples used for each replicate within an experiment do not share the given properties.
Inconsistent paired_with Two read pair files from paired-end sequencing are annotated as belonging to different experiment replicates.  
Missing paired_with A paired end fastq in this dataset is missing metadata on its paired file.
Inconsistent read count A fastq in this dataset is paired with a fastq with a different read count.
Inconsistent target of control experiment

A control experiment does not have its target annotated as “control” in the metadata. Rather, the target is some transcription factor or chromatin modifier.

Inconsistent control The experiment and its control were not performed on the same type of biosample, e.g. same cell line or tissue type; the control file is of a different format than the experimental file being controlled, e.g. fastq vs. idat; or the control file being used is from a control experiment that is not listed in the possible_controls property of the experiment. 
Inconsistent document_type A document has been attached to a file, but does not describe the file format specifications for that file. 
Inconsistent mutated_gene organism The organism from which the biosample came does not match the organism of the mutated gene in the donor. Donor mutated_gene should be of the same species as the donor and biosample. 
Invalid donor mutated_gene Donor mutated genes should not be tags, controls, recombinant proteins, or modifications
Invalid dates The date that the cell culture was harvested precedes the date on which the culture was started.  
Invalid possible_control The experiment being used as a control is not designated as a control in the metadata.
Invalid depleted_in_term_id Before sequencing the library, a specific type of nucleic acid (e.g. polyA RNA) was removed. The nucleic acid that was sequenced is listed as the same nucleic acid that was removed.
Unexpected step_run The incorrect pipeline step was attached to the file in question, e.g. a peak calling step that outputs peak files was attached to an alignment file instead.
Missing analysis step_run A processed file in the dataset has no metadata about the pipeline run that generated it.
Matching md5 sums A processed file in this dataset is identical to another file.
Lacking processed data The dataset has no downstream processed data.
Missing genetic modification characterization
Missing genetic modification characterization
A genetic modification used in this experiment is lacking a characterization.
Missing biosample characterization
Missing biosample characterization
A genetically modified biosample used in this experiment is lacking a characterization.
File validation error A file in the dataset failed to pass automated validation checks during the submission process.
Missing fragmentation method Hi-C libraries must specify the fragmentation method used to generate them.
Missing genetic modification reagents Genetic modifications should specify any reagents used to perform the modification.
Missing queried RNP size range eCLIP libraries should specify the queried RNP size range.
Inconsistent assembly A file in this dataset is aligned to an assembly different from the assembly of the file it derives from.
Improper control type of control experiment The experiment lacks a control experiment with control_type of "input library" or "wild type".
Missing control type of control experiment The control experiment has no control_type specified.
Unexpected target of control experiment The experiment is a control experiment, but has a specified target.

Dataset Consistency Audits

Audit Message Explanation
Missing reference

Publication file sets should be linked to a specific publication.

Missing IHEC required assay

Reference Epigenome datasets must have at least one of each of the IHEC required assays.

Multiple donors in reference epigenome

A reference epigenome dataset has experiments conducted in biosamples from more than one donor.

Multiple biosample treatments in reference epigenome

A reference epigenome dataset should not have multiple kinds of treatments between experiments, even if type of the biosample used is the same.