What is ENCODE?
The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
This website is the ENCODE Portal, which hosts the data produced by members of the ENCODE Consortium and also provides access to this data to the wider scientific community.
Finding, downloading, and using data
How do I find data?
How can I find an experiment by transcription factor or by target?
You can find find experiments targeting specific transcription factors and other targets by filtering experiments on the Experiment search page. Enter the name of the target in the search box above the "Target of assay" facet or scroll through the list. Click the name of the desired target to select experiments with that target. A tutorial illustrating this process is available here (link opens in new tab).
Another method is to construct a URL query to filter based on the
target.label property of experiments. An example of such a query is:
This would filter results to only experiments targeting H3K4me3.
How do I download files?
There are multiple ways to download data. Programming skills are not required to download files. When viewing the Portal in a browser, one can download files:
- From an individual experiment's page: Files in a dataset are listed in the "Files" section of an experiment's page. Beside each file's accession, which is a link to the file's individual page, is a clickable download icon:
- From a file's individual page: A download button is displayed on each file's page at the bottom of the "Summary" section.
There are also batch download options for more efficient download of large numbers of files. Users can first download a text list of metadata and file links. These links can then be downloaded using the command line with a command such as
curl, which will download files into your working directory. More guidance on this topic is available on the REST API page, and general information is available on the Batch Download page. Additionally, a tutorial demonstrating batch download is available here (link opens in new tab). Batch download can be accessed:
- From a search result page: When viewing search results in List view, a batch "Download" button is shown above the list to allow users to download every dataset in the results.
- From the cart: Users can put datasets in their cart, and then view their collected datasets from the Cart manager. A batch download button is available from the Cart manager page. For details on how to use the Cart, please visit the Cart page, or start the Cart Basics tutorial (link opens in new tab).
How can I download subsets of files from experiments? How can I get more information on files in “files.txt”?
Your first approach should generally be to use facet searching to get the most specific possible grouping of desired experiments. More details on how to facet search are available here, as well as in this tutorial (link opens in new tab). From an individual experiment page, you can download specific files by using the download links in the Files table, or the download link on each file page.
Downloading from multiple experiments (i.e. a batch download) can be done from either the experiment search page or the cart. Guidance on using the cart can be found on the Cart page. From either of the two pages, clicking the “Download” button above the list of experiments downloads a file named “files.txt”. This file contains a list of URLs for every file linked to the selected experiments. The first line of "files.txt" contains a URL for a separate file named “metadata.tsv”. Copying this URL to your browser window, or fetching it from the command line using
curl, downloads a tab-delimited table of metadata that includes properties which are not visible as facets from the Experiment search page, as well as a direct download link. Each file listed in “files.txt” is on its own row. By examining “metadata.tsv” in a spreadsheet program, it is possible to further filter the list of files by the additional metadata properties and determine which ones you want to download.
For example, if you only wanted to download raw data, you could filter the table to leave only files with “output type” of “reads”, then download the filtered files from the links specified in the "File download URL" column.
How can I download data directly from S3?
Every file object that has a file available to download also has a metadata property called
s3_uri. This field contains the full S3 path for the file. A direct http link is also stored in the
url subproperty of the
cloud_metadata property. These can be found by clicking the JSON button in the upper right corner of a file page to view it in JSON format. For example, the S3 path of ENCFF824ZKD is:
And the http link is:
To learn more about fetching these links programmatically, which is useful when downloading many files, we recommend reading our REST API help page.
Why can't I download file data?
You may be attempting to download restricted data files. The ENCODE Portal hosts experiments from the Roadmap Epigenomics Project and is able to freely provide the processed data files. For some of these experiments, however, ENCODE does not have consent for sharing the raw sequencing data. A complete list of these experiments is found at this link. For these experiments, we are able to freely provide the processed data. To obtain the raw data, it is necessary to submit a Data Access Request for the Costello and Broad projects at dbGaP.
The ENCODE Portal occasionally experiences server issues leading to slowed or stopped downloads. In these situations, please notify our help desk so that we can attend to the issue.
Can I use this data for my research?
Users are allowed to freely download, analyze, and publish results based on ENCODE data without restriction, provided they cite the data in accordance with our citation guidelines.
How should I cite ENCODE data?
Please visit our citation guidelines page for details on citing ENCODE data.
Interpreting ENCODE data
How do I know if this data is good?
Whether a dataset is “good” depends on what the user is measuring by, so instead we provide multiple methods to help users gauge whether a dataset is suitable for their use:
- Statuses: In general, "released" data is more likely to be of higher quality. Datasets are typically "archived" because we have more current data to replace them, or are "revoked" if they are later found to fall below data quality standards or have other serious errors. See "What does it mean for an experiment or file to be released/archived/revoked?"
- Audits: All datasets available on the Portal are checked and flagged if they fail an audit. These audits fall in several different categories, and can be read about in more detail on the Audit page. We recommend viewing the audit details and determining whether the dataset is still acceptable based on your own use case. See "Is this data OK to use, even though there is an audit flag on it?" and the tutorial demonstrating how to view audits (link opens in new tab).
- QC metrics: Processed data generated from ENCODE uniform processing pipelines also have associated quality metric information. To view these, visit an experiment page, scroll to the Association Graph, and click on the small, light green bubbles in each file block. The metrics are also displayed on the individual file page. A list of terms relevant to understanding the quality metrics is available here.
What does it mean for an experiment or file to be released/archived/revoked?
All released experiments and their associated files have been approved by the submitting lab and the ENCODE DCC after being reviewed for data quality and metadata accuracy, and are considered the most current data available. In some cases, a previously released experiment will be reclassified as "archived" or "revoked" and are no longer considered current. Experiments can be revoked if errors in the data were discovered after release, or if the assay quality metrics fell significantly below an updated set of data standards. In contrast, archived experiments do not have serious errors, but may be superseded by a more recent, released experiment with higher quality data. If available, "revoked" or "archived" experiments will link to the superseding experiment from the individual experiment page.
For an in depth explanation of object statuses, please refer to the Statuses page.
Is this data OK to use, even though there is an audit flag on it?
This is dependent on how you intend to use the data. Audits are split in 3 tiers by severity, with red being most severe followed by orange and yellow. These are further divided by category. For example, different audits exist for read coverage issues, library complexity issues, and other types of issues. Audit information can be found on the search page and each individual experiment page. The audit button is visible for each experiment on the right side of the search page under the accession number and status. While viewing the experiment page, click the audit button, and then click the plus (+) symbol to view a detailed description of what is causing the audit.
The Audit page contains a detailed description of each audit that can help you decide if the experiment is still acceptable for your purposes.
How can I interpret this data? How was this data obtained? How was the assay performed?
More detailed information about the the experiments and their data are available in the following sections.
The Documents section located at the bottom of every experiment page contains the wet lab protocols and computational methods provided by the ENCODE mapping centers.
Most assays have standards and quality metrics associated with it. Please see the Assays and Standards page for details.
For processed data, the Files section contains an Association Graph on each experiment page. The graph depicts the relation of raw and processed files to each other. Files, represented in yellow boxes with accession and file type, are linked by blue colored boxes that represent steps in a computational pipeline. Each blue step box is labeled with a basic description of the analysis performed. Clicking on a yellow box brings up a pop-up with information about how that particular file was generated and a link to download the file. Clicking on a blue box brings up a pop-up with information about that processing step and a link to more information about the pipeline as a whole. If the pipeline is an ENCODE uniform processing pipeline, the link will further link to the GitHub repository for the pipeline as well as a page outlining the data standards for the assay type.
If this information does not fully answer your questions, please contact our help desk and we will respond to the best of our ability and/or get in contact with members of the mapping centers that submitted those data.
Are these experiments comparable?
When deciding if data produced by different experiments can be compared, we recommend examining the biosample, assay, and processing information:
- Some users are interested in results from different assays on the same biosample type. Biosample information can be viewed by clicking the accessions of the biosamples listed in the replicates table of an experiment page, or embedded in an experiment object if fetched programmatically in JSON format. Biosamples contain not only the term name (e.g. K562), but many other properties, including treatments, genetic modifications, subcellular fraction, or phase. A full list of tracked properties can be viewed on the schema page for Biosample objects.
- Assay information can be found on the Assays and Standards page, as well as in protocol documents attached to the experiments. See "How was the assay performed?"
- Was the data processed and analyzed in a comparable way? Data should also be processed with the same methods; uniformly processed data can be more easily used for integrative analysis. For this reason, we utilize the ENCODE uniform processing pipelines whenever possible. These pipelines are freely available on the DCC GitHub for anyone to use. To indicate that a processed file was produced using a uniform pipeline, the
labproperty of the file is designated as "ENCODE Processing Pipeline." This information is displayed on each File's own page, as well as on the Experiment page. More information about the processing pipelines is available on the Pipelines page. If the processed data is not suitable for comparison, raw data is available to download and process uniformly.
Is there something wrong with this dataset or with the metadata?
Please email our help desk about any concerns you have with any object or its metadata, noting the accession of the problematic object or and describing your concerns.
How do I submit data to the Portal?
Please see our step-by-step instructions here. Note: if you aren’t a member of the ENCODE Consortium but are still interested in submitting your data, please contact us at firstname.lastname@example.org.
I have more questions. Where can I get additional help?
If you have any questions, concerns, or comments, please feel free to email our help desk at email@example.com.