Getting Started

Introduction

Welcome to the ENCODE Portal! The ENCODE Portal serves as the primary source of data generated by the ENCODE Consortium and up-to-date information about the project, including data releases, publications, and upcoming tutorials. This site is developed and maintained by the Data Coordination Center (DCC). All data generated by the ENCODE consortium are submitted to the DCC and reviewed for quality prior to release to the scientific community. No account is needed to view released data.

This document describes what data are available at the ENCODE Portal, ways to get started searching and downloading data, and an overview to how the metadata describing the assays and reagents are organized. ENCODE data can be visualized and accessed from other resources, including the UCSC Genome Browser and ENSEMBL.  

Please contact the ENCODE DCC via email (encode-help@lists.stanford.edu) or Twitter (@encodedcc) if you have any further questions.

Information available at the portal

The ENCODE Portal contains the following types of data generated by the ENCODE Consortium:

Additional information about the activities of the ENCODE Consortium are provided on the portal:

Using the portal

Data at the portal can be browsed and searched in the following ways: 

Once the assays of interest have been identified via browsing, searching, or programmatically accessing the portal, the files can be downloaded using wget or curl.  See the "Download files" section below for more details

Files can be visualized at the UCSC Genome Browser using the "Visualize Data" button.

Browse and filter data

Select an "Assays", "Biosamples", or "Antibodies" link located below the "Data" menu in the toolbar to browse those data types (Figure 1a). These links will list all available data for that data type. These results can be narrowed and filtered by selecting one or more values in a metadata category on the left hand side of the page (Figure 1b).  For example the "Assay" category lists all assays that have been used to generate ENCODE data.  Multiple values from each facet can be selected at any one time (Figure 1c). Objects can also be filtered by status, with "released" and "revoked" being the categories currently available to the public. 

Figure 1. (A) Select "Assays", "Biosamples", or "Antibodies" from the Data menu item in the toolbar. (B) The metadata category on the left hand side lists all available values to help filter the results. (C) Multiple values can be selected at any given time.

Search for data

The website can be searched by entering a search term in the search box located in the upper right hand corner in the toolbar (Figure 2). This search box will appear on every page. Example search terms include a biosample (e.g."skin"), an assay name (e.g. "ChIP-seq"), or a protein target of an antibody (e.g. "CTCF").  The search results can be narrowed by data type, an experiment, biosample, or antibody, and then further filtered using the metadata categories on the left hand side (refer to the "Browse and filter data" section above; Figure 1b).

Figure 2. Enter a search term into the search box.

Programmatic access for bulk downloads

Note: We are currently developing a "shopping-cart" based method to bulk download data.

In addition to web-based browsing and searching, the ENCODE portal can be accessed programmatically via the REST API. Instructions on how to browse and search for ENCODE data programmatically are provided in the REST API help document.  In brief, all queries that can be performed via the web can be used as programmatic queries.  

Once the JSON objects for the results are retrieved, the location of the files can be found in the href property in the file object (Figure 3). Prepend the URL https://www.encodeproject.org to obtain the full path of the file.  See the "Download Files" section for commands to use to download files.

Further information about what objects can be expected to be linked together is described below in the "Data model" section.

Figure 3. The highlighted field refers to the href property of the file object. This property contains the location of the file.

Download files

Files are named by their accession and contain file format information.  Links to download individual files are available at the bottom of each page that described a single assay. Files can be downloaded directly from the web page or the link can be copied to be downloaded elsewhere. 

Via the wget command:

 > wget https://www.encodeproject.org/files/ENCFF002CTW/@@download/ENCFF002CTW.bed.gz

Via the curl command:

 > curl -O -L https://www.encodeproject.org/files/ENCFF002CTW/@@download/ENCFF002CTW.bed.gz

Visualize data

 

On every experiment page, there is a "Visualize Data" button under the Files section that launches a Genome Browser view when there is data suitable for visualizations. Files must be in bigBed or bigWig file format to be visualized as a track hub.

File Submission

For more information on submitting data files to the DCC, visit the File Submission page.

Data organization

Metadata

The DCC, in collaboration with the labs performing the assays and the Data Analysis Center (DAC), have defined a set of metadata that are used to help describe the experimental conditions that were used to generate the data, processing steps that were performed to analyze and interpret the data, and metrics to evaluate the quality and reproducibility of the data.  These metadata are displayed on the pages that describe the assays, biosamples, and antibodies. 

Accessions

ENCODE DCC creates accessions for metadata that can be reused in experimental protocols and computational analyses.  This ensures that the exact assay or reagent is being referred to when assays are being discussed or files are being analyzed.  The accessions are in the format ENC[SR|BS|DO|AB|LB|FF][0-9]{3}[A-Z]{3} where [SR|BS|DO|AB|LB|FF] refer to the metadata type given the accession.  Accessions are given to the following types of metadata:

  • An assay: Each assay is given an accession.  Typically, the replicates will be performed using the same method, performed on the same kind of biosample, and investigating the same target.  Assays may contain one or more biological replicates. A sample accession for an assay is ENCSR000DVI.
  • A biosample: An accessioned biosample refers to a tube or sample of that biological material that is being used.  For example, the following would all be given a biosample accession: (1) a batch of a cell line grown on a specific day, (2) the isolation of a primary cell culture on a specific day, or (3) the dissection of a tissue sample on a specific day. A sample accession for a biosample is ENCBS046RNA.
  • A strain or donor: Every strain (for model organisms) and donor (for humans) is given a donor accession.  This accession allows multiple samples obtained from a single donor to be grouped together. The donor information is listed with the biosample: ENCBS046RNA
  • An antibody lot: Each unique antibody lot is accessioned so that assays can refer specifically to that antibody.  Each antibody lot is also associated with characterizations for its target in a species. 
  • A library: A unique library that can be resequenced is accessioned to ensure the correct files are associated with the nucleic acid material that has been created from the biosample.  The library accession and experimental details of how the library are constructed are displayed on the assay page:  ENCSR000DVI.
  • A file: Each data file is accessioned.  This accession is used as the file name, along with its file format as an extension.  The file accession is associated with the contents of that file.  When a new file is submitted to replace an existing file, the new file is given a new accession and related to the older file.  Files are displayed at the bottom of an assay page:  ENCSR000DVI.

Ontologies

The DCC uses a set of controlled vocabularies and ontologies to maintain consistent use of language when describing the metadata for the experiment assays.  The use of consistent language is essential to ensure all the correct results are returned when browsing or searching the metadata.  In addition, the use of consistent language facilitates the integration of datasets from diverse projects.  To this end, the ENCODE DCC is using the following ontologies to capture specific metadata categories:

Data model

The metadata captured for the experimental assays and computational analyses are organized as objects that have a defined relationship to each other.  In general, the data model is organized around ensuring that the replicate structure of the assays are represented along with the reagents, like biosamples, that were used in the assay (Figure 4). In addition, the assays are associated with the raw data that are generated from the assay and the processed data from these raw data. The assay accession serves to group all related replicates together.

Figure 4. Representation of the core of the ENCODE DCC data model.

The ENCODE portal provides formatted views of each data object, known as a profile page.  Profile pages for the metadata model depicted in Figure 4 include the following:

The entire data model is available at the ENCODE DCC github schema repository and visualized in svg.  Replace the object name in the profile URL to view the formatted schema.