Audits

What are audits?

​The ENCODE Data Coordination Center has implemented a system of audits, or flags, to provide additional information to the research community about the quality of the data on the portal. These flags may indicate an error in the experimental metadata, or may indicate that the data itself does not meet some aspect of the consortium's standards. The flag's color corresponds to the severity of the problem.    

Read Coverage Audits

Audit message

Explanation

Extremely low read length
Insufficient read length
Low read length

The sequencing read length is below the given threshold for the assay type. For a list of standards for individual assay types, see the following links:
WGBS
DNase-seq
Long RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP

Control insufficient read depth
Control low read depth

A ChIP-seq control should be sequenced to similar depth as the experiment.  For this experiment, the control depth is insufficient.

Extremely low read depth
Insufficient read depth
Low read depth
The read depth of the alignment files is below the given threshold for the assay type. For a list of standards for individual assay types, see the following links:
WGBS
DNase-seq
Long RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP
Insufficient coverage Each biological replicate processed by the WGBS paired-end pipeline must have 30X coverage

Replication Audits

Audit message

Explanation

Insufficient replicate concordance
Low replicate concordance
Borderline replicate concordance

The value for the metric used to measure replicate concordance is below the given threshold for the assay type, so the experiment has been flagged for poor reproducibility. For a list of standards for individual assay types, see the following links:

WGBS
DNase-seq
Long RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP

Unreplicated experiment ENCODE experiments (excluding ENTEx/GTEx) are required to have at least two biological replicates. Experiments using samples from the GTEx consortium do not require more than one replicate because of the limited availability of tissues.  
Inconsistent replicate A file came from a replicate that belongs to a different experiment from the one in which the file is found, or the replicate numbers do not match between parent and derived files. 
Technical replicates with not identical biosample Two technical replicates do not share the same biosample. 
Replicate with no library The library created from the experimental replicate was not uploaded and/or attached to the replicate.

Library Complexity Audits

Audit message

Explanation

Severe bottlenecking
Mild to moderate bottlenecking
Bottlenecking for ChIP-seq assays is measured using PCR Bottlenecking Coefficients 1 and 2 (PBC1 and PBC2). For more information on the expected values, see the ChIP-seq standards page
Insufficient library complexity
Poor library complexity
Library complexity for ChIP-seq experiments is measured using the Non-Redundant Fraction. For more information on the expected values for NRF, see the ChIP-seq standards page

Enrichment Audits

Audit Message

Explanation

Extremely low SPOT score
Insufficient SPOT score
Low SPOT score

The SPOT (Signal Portion of Tags) score is a measure of enrichment used in DNase-seq experiments. For more information on the expected values, please see the DNase-seq standards page

Uniform Pipeline Requirements

Audit message

Explanation

Missing spikeins Long RNA-seq, shRNA knockdown, and CRISPR editing followed by RNA-seq assays require spike-ins, but they are missing in the given experiment.
Missing RNA fragment size The library created for an RNA-seq experiment lacks information on the size of the library  fragments.
Missing input control ChIP-seq experiments must have at least one input control, but the control given is not an input control. For example, it may be a mock immunoprecipitation instead.
Missing run_type Fastq file does not contain information on the run type used to produce it (single vs paired end).
Inconsistent control read length
Inconsistent control run type
Inconsistent control platform

The uniform pipelines expect that the controls and the files they control share identical run_types and read lengths. Otherwise, trimming may be required. Similarly, the pipelines expect the same or comparable sequencing platforms. 

Inconsistent platforms

The uniform pipelines expect files within an experiment to have been produced from the same or comparable sequencing platforms. 

Mixed read lengths
Mixed run types

A single experiment contains fastq files with different read lengths and/or run types, either within or among replicates. 

Non-standard run_type

The pipelines for specific assay types require either single-end or paired-end sequencing, but the required sequencing type was not performed. For a list of standards for individual assay types, see the following links:
WGBS
DNase-seq
Long RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP

Not compliant platform The sequencing platform used is not compatible with the processing pipeline. For a list of standards for individual assay types, see the following links:
WGBS
DNase-seq
Long RNA-seq, shRNA, CRISPR knockdown
Small RNA-seq
RAMPAGE
ChIP-seq
ATAC-seq
microRNA-seq
microRNA counts
RNA Bind-n-Seq
eCLIP

Antibody Audits

Audit message

Explanation

Duplicate lane review The flagged antibody has already been reviewed for the biosample type in question. For example, there may be multiple lanes in a single western blot that are characterizing the antibody is K562. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays
Not tagged antibody
Inconsistent target
The antibody and experimental targets of interest do not match. This may be because the experimental target is tagged and the antibody does not apply to that tag, or because the target proteins are completely different between the antibody and the experiment. 
Not eligible antibody The antibody used in the experiment is not eligible for use because it has not been fully characterized in the biosample type (e.g. liver tissue or K562) used by the experiment.
Partially characterized antibody The antibody used in the experiment has either its primary or secondary characterization (but not both) for the given biosample (e.g. liver tissue or K562) used by the experiment.
Uncharacterized antibody The antibody used in the experiment is lacking a primary and secondary characterizationfor the given biosample (e.g. liver tissue or K562) used by the experiment.
Antibody not characterized to standard The antibody used in the experiment has non-compliant characterizations, and no compliant characterizations for in the biosample type (e.g. liver tissue or K562) used by the experiment.
Antibody characterized with exemption

The antibody used in the assay did not pass its primary characterization test, but the secondary characterization was able to rescue the primary and it passed with exemption. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays

Characterizations not reviewed

The antibody has old characterizations, perhaps from previous iterations of ENCODE, that were not reviewed or submitted for review. 

No characterizations submitted

The antibody lacks any attempt at characterization. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays

No primary characterizations

The antibody does not have any attempt at primary characterization in accordance with the ENCODE antibody characterization standards. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays

No secondary characterizations

The antibody does not have any attempt at secondary characterization in accordance with the ENCODE antibody characterization standards. For more on antibody standards, please see the antibody characterization guidelines for transcription factorschromatin remodelers, RNA binding proteins, and antibodies used in fRIP assays

Need compliant primaries

Any and all attempts at primary characterization of this antibody do not meet the standards.

Need compliant secondaries

Any and all attempts at secondary characterization of this antibody do not meet the standards.

Metadata Audits

Audit message

Explanation

Missing antibody
Missing biosample
Missing biosample_term_id
Missing biosample_term_name
Missing biosample type
Missing donor
Missing target
Missing documents

If an assay type requires any of the following, but the required property was not provided, the experiment is given a flag: antibody, ontology name or ID for the assay type, biosample used to make the library, the ontology term or ID for the biosample, the type of biosample (e.g. immortalized cell line or tissue), the biosample’s donor, the target or molecule of interest, the transfection type (i.e. stable or transient), or protocol documents.
Multiple paired_with The raw files are the product of paired ended sequencing, and the file in question has been marked as paired with more than one other file, which is not allowed. 
Missing raw data in replicate Each experimental replicate's library must have a corresponding raw sequencing file, such as a fastq. 
Missing derived_from A processed file should have information on the files from which it was derived. For example, an alignment file should indicate which raw data files and references indices were used to create it.

Missing possible controls,
Missing possible controls
Mismatched control
Missing controlled_by

ChIP-seq, RAMPAGE, and CAGE experiments all require controls. A flag appears if the control is missing or has a different biosample type from the experiment it controls (e.g. K562 versus MCF-7). A flag will also appear if the control files are not matched to corresponding experimental files; this information is stored in a property called "controlled_by" in the experimental file object. Please note that the "missing possible controls" assay has different flag colors depending on the assay type and the project phase. 

Missing genotype
Missing external identifiers

Biosample donors for worms and flies (Caenorhabditis and Drosophila) must have their genotypes listed in accordance with the nomenclature rules (for fly, for worm), and must have external references (e.g. GEO/SAMN IDs) listed. 

Inconsistent ontology term The ontology term does not match the ontology ID provided. 
Inconsistent depleted_in_term length
Depleted_in length mismatch
Some tissue type was removed from the biosample before library creation. The list of ontology term names of the tissues removed does not match the list of the ontology term IDs in either the biosample or the library. 
Inconsistent organism
Inconsistent donor
Inconsistent library biosample
Inconsistent age
Inconsistent sex
The biosamples used for each replicate within an experiment do not share the given properties.
Inconsistent paired_with Two read pair files from paired-end sequencing are annotated as belonging to different experiment replicates.  
Inconsistent target of control experiment

A control experiment does not have its target annotated as “control” in the metadata. Rather, the target is some transcription factor or chromatin modifier. 

Inconsistent control The experiment and its control were not performed on the same type of biosample, e.g. same cell line or tissue type; the control file is of a different format than the experimental file being controlled, e.g. fastq vs. idat; or the control file being used is from a control experiment that is not listed in the possible_controls property of the experiment. 
Inconsistent document_type A document has been attached to a file, but does not describe the file format specifications for that file. 
Inconsistent mutated_gene organism The organism from which the biosample came does not match the organism of the mutated gene in the donor. Donor mutated_gene should be of the same species as the donor and biosample. 
Invalid donor mutated_gene Donor mutated genes should not be tags, controls, recombinant proteins, or modifications
Invalid dates The date that the cell culture was harvested precedes the date on which the culture was started.  
Invalid possible_control The experiment being used as a control is not designated as a control in the metadata.
Invalid depleted_in_term_id Before sequencing the library, a specific type of nucleic acid (e.g. polyA RNA) was removed. The nucleic acid that was sequenced is listed as the same nucleic acid that was removed.
Unexpected step_run The incorrect pipeline step was attached to the file in question, e.g. a peak calling step that outputs peak files was attached to an alignment file instead.

Dataset Consistency Audits

Audit Message

Explanation

Missing reference

Publication file sets should be linked to a specific publication.

Missing IHEC required assay

Reference Epigenome datasets must have at least one of each of the IHEC required assays.

Multiple donors in reference epigenome

A reference epigenome dataset has experiments conducted in biosamples from more than one donor.

Multiple biosample treatments in reference epigenome

A reference epigenome dataset should not have multiple kinds of treatments between experiments, even if type of the biosample used is the same.