The ENCODE REST API

Introduction

Direct interaction with the ENCODE DCC metadata database is typically done through scripts you write and execute on your own Mac, PC or server. Your scripts interact with the database through an industry-standard, Hypertext-Transfer-Protocol-(HTTP)-based, Representational-state-transfer (RESTful) application programming interface (API). Data objects exchanged with the server conform to the standard JavaScript Object Notation (JSON) format. JSON is a defined data interchange format that is used web-wide. If you’re familiar with JavaScript’s object literal notation or Python’s dictionary datatype, then JSON-formatted text will look familiar. You will almost certainly use libraries for your language of choice to handle the network connection and parse the objects returned (requests and json for Python, for example). We have written example scripts you can look at here

Quickstart examples

While getting objects from the database, updating existing objects or adding new objects can often be done in just a few lines of code, here are some quick ways of testing your connection and exploring the ENCODE data objects without writing any code.

Prerequisites

  1. A JSON pretty-printer plugin for your web browser, such as JSONView (for Chrome or Firefox) or JSON Formatter (for Safari)
  2. The curl command, which ships with Mac OSX and all major LINUX/UNIX implementations. If you don’t have curl, an executable can be downloaded from here. Make sure you get the SSL and SSH aware version.  You will find curl useful later for debugging your scripts (it always works), so if you don’t already have it, you’ll want to get it.

ENCODE JSON in your browser

Look at biosample ENCBS000AAA at https://www.encodeproject.org/biosamples/ENCBS000AAA/.

If you add the following parameter to the URL: /?format=json, you should see the page viewed as JSON.

By adding the /?format=json parameter, you've instructed the ENCODE webserver to send you the raw contents of the database record for ENCBS000AAA, in JSON format. Rather than being rendered in your browser, you see the native JSON object as it is stored in the ENCODE database. This is a biosample object, and you will see all the properties and their values for this object. You need a JSON pretty printer (see prerequisites above) in order to make sense of the JSON. Notice that many of the object’s properties contain other (nested) objects.

Notice also that the object has its own URL, and that the URL corresponds to the object’s accession number. This always works to access ENCODE objects. In fact, because accession numbers are unique, you can construct a URL from the accession number alone and the server will return your object. For example https://www.encodeproject.org/ENCBS000AAA

As you learn more about the ENCODE REST API, you’ll see that when you retrieve objects (using the HTTP “GET” method) from the server, they always come back as JSON. When you create new objects (using the HTTP “POST” method) or update existing objects (using the HTTP “PATCH” method), you always format a JSON object on your side and send that to the server.

A simple command-line GET

Try this in a terminal window:

$ curl -H "Accept: application/json" https://www.encodeproject.org/biosamples/ENCBS000AAA/

You should see lots of text that starts with something like "{"system_slims": [], "accession": "ENCBS000AAA", ...".

Your command instructs curl to download the database record for the ENCODE biosample ENCBS000AAA in JSON format (it's the same object rendered above and at https://www.encodeproject.org/biosamples/ENCBS000AAA/). In fact, the ENCODE server sees that the request is coming from something other than a web browser running the ENCODE web app, so it returns the object in JSON format. All curl is really doing is downloading the "web page" that corresponds to that ENCODE record. Transport is by HTTP, just like a web browser.

This sort of JSON-formatted text is sometimes referred to as a "JSON document". You can redirect the output to a file:

$ curl -L -H "Accept: application/json" https://www.encodeproject.org/biosamples/ENCBS000AAA/ > ENCBS000AAA.json
$ more ENCBS000AAA.json

In this example, the -L option tells curl to follow redirects.  While not necessary in this example, "-L" is necessary for identifiers, like aliases, that get redirected to another resource.  In cases like that, without -L you may see something like:

{"status": "error", "code": 301, "description": "Moved Permanently" ... }

If in doubt, use "curl -L".

If you supply an invalid identifier (like an accession number that doesn't refer to anything), you'll see something like:

{"status": "error", "code": 404, "description": "The resource could not be found." ... }

You can see that even error responses come back in JSON format. This makes parsing the results of your queries much easier.

REST, JSON and curl in perspective

As you develop your scripts, curl can be a valuable testing tool. If you can get an object with a curl command, you should be able to get it with your script. You can use curl to push new objects to the database and modify existing ones, too, but in a production environment you will want the control of a script over the one-off convenience of command line curl.

The important thing to remember is that each ENCODE object is accessible via a simple URL containing an identifier (like an accession number) for that object. That's all curl does, that's all your browser does, and that's all your scripts will do. The resource you GET, using that URL, will be in JSON format. You will then use a library to parse the JSON and construct a native data structure (like a Python dictionary). Similarly, to submit new metadata (or update existing metadata), you will make JSON in your script (usually from a native data structure like a Python dictionary) and then use HTTP methods to POST or PATCH to a URL. For more help with submitting files with your metadata, go to File Submission.

The ENCODE REST API

The ENCODE REST API uses GET to transport JSON-formatted information between the server and your scripts.

Prerequisites

  1. A library or module for your language of choice that supports HTTP. The "requests" library here for Python is good, and will be used for examples in this documentation.
  2. A library or module for parsing JSON-structured text and building native data structures. The Python "json" library here is good.

GET

The HTTP GET request is used to retrieve objects from the ENCODE server. GET will work without a username and password to fetch publicly-released ENCODE objects.

#!/usr/bin/env python2
# -*- coding: latin-1 -*-
'''GET an object from an ENCODE server'''
 
import requests, json
 
# Force return from the server in JSON format
HEADERS = {'accept': 'application/json'}
 
# This URL locates the ENCODE biosample with accession number ENCBS000AAA
URL = "https://www.encodeproject.org/biosample/ENCBS000AAA/?frame=object"
 
# GET the object
response = requests.get(URL, headers=HEADERS)
 
# Extract the JSON response as a python dict
response_json_dict = response.json()
 
# Print the object
print json.dumps(response_json_dict, indent=4, separators=(',', ': '))

This script GET's the JSON representation of ENCODE biosample ENCBS000AAA. The ENCODE server's JSON response is extracted as a Python dict using the response.json() method. The dict is then dumped back to a string and pretty-printed with useful indentation.

The output should look something like:

{
    "system_slims": [],
    "accession": "ENCBS000AAA",
    "passage_number": 5,
    "alternate_accessions": [],
    "culture_harvest_date": "2012-04-10",
    "aliases": [
        "richard-myers:MCF7-003"
    ],
    "rnais": [],
    "submitted_by": "/users/df9f3c8e-b819-4885-8f16-08f6ef0001e8/",
    "dbxrefs": [
        "UCSC-ENCODE-cv:MCF-7"
    ],
    "uuid": "56e94f2b-25ac-4c58-9828-f63b66220999",
    "biosample_type": "immortalized cell line",
    "schema_version": "4",
    "note": "(PMID: 4357757)",
    "source": "/sources/atcc/",
    "developmental_slims": [],
    "protocol_documents": [
        "/documents/984071d4-9149-476a-b353-93592c6f48f3/"
    ],
    "pooled_from": [],
    "status": "CURRENT",
    "description": "mammary gland, adenocarcinoma",
    "age_units": "year",
    "life_stage": "adult",
    "constructs": [],
    "treatments": [],
    "lab": "/labs/richard-myers/",
    "award": "/awards/U54HG004576/",
    "donor": "/human-donors/ENCDO000AAE/",
    "@id": "/biosamples/ENCBS000AAA/",
    "culture_start_date": "2012-03-16",
    "product_id": "HTB-22",
    "characterizations": [],
    "url": "http://www.atcc.org/Products/All/HTB-22.aspx",
    "biosample_term_id": "EFO:0001203",
    "age": "69",
    "organ_slims": [],
    "biosample_term_name": "MCF-7",
    "health_status": "unknown",
    "date_created": "2013-12-12T05:50:02.101495+00:00",
    "organism": "/organisms/human/",
    "@type": [
        "biosample",
        "item"
    ]
}

The important point is that the response object has been cast into a native Python dict datastructure. For example, protocol_documents is an array of document identifiers and so you can loop over that array.

Adding the following code to the example script above loops through the protocol_documents array, GET's each document object, and prints its description:

biosample = response_json_dict
 
for doc_URI in biosample['protocol_documents']:
	doc_response = requests.get('https://www.encodeproject.org/'+doc_URI, headers=HEADERS)
	document = doc_response.json()
	print document['description']

Which prints:

MCF-7 Cell Culture and 4-hydroxytamoxifen treatment

Programmatic search

In the GET example above, we use the ENCODE REST API to retrieve an individual object using its ENCODE accession number. You can also search for objects programmatically and get the search result back in JSON format. The ENCODE web app is a good place to start to see an example. Enter the string "bone chip" in the search box at the top right of the page at https://www.encodeproject.org/. The result are those ENCODE objects that match that string (ChIP experiments having something to do with bone, in this case).

Search URL format

The same URL that returns the search results to the web app can be used in a script. This script does the same search and returns the results in JSON format:

#!/usr/bin/env python2
# -*- coding: latin-1 -*-
'''GET the results of a search from an ENCODE server'''

import requests, json

# Force return from the server in JSON format
HEADERS = {'accept': 'application/json'}

# This searches the ENCODE database for the phrase "bone chip"
URL = "https://www.encodeproject.org/search/?searchTerm=bone+chip"

# GET the search result
response = requests.get(URL, headers=HEADERS)

# Extract the JSON response as a python dict
response_json_dict = response.json()

# Print the object
print json.dumps(response_json_dict, indent=4, separators=(',', ': '))

The output of the script should start with something like this:

{
    "title": "Search",
    "notification": "Success",
    "@graph": [
        {
            "status": "CURRENT",
            "lab.title": "Bing Ren, UCSD",
            "description": "CTCF ChIP-seq on 8-week mouse bone marrow",
            "assay_term_name": "ChIP-seq",
            "accession": "ENCSR000CBL",
            "biosample_term_name": "bone marrow",
            "biosample_type": "tissue",
            "dataset_type": "experiment",
            "target.organism.name": "mouse",
            "target.label": "CTCF",
            "award.project": "ENCODE",
            "@id": "/experiments/ENCSR000CBL/",
            "@type": [
                "experiment",
                "dataset",
                "item"
            ]
        },
        {
            "status": "CURRENT",
            "lab.title": "Bing Ren, UCSD",
            "description": "POLR2A ChIP-seq on 8-week mouse bone marrow",
            "assay_term_name": "ChIP-seq",
            "accession": "ENCSR000CBM",
            "biosample_term_name": "bone marrow",
            "biosample_type": "tissue",
            "dataset_type": "experiment",
            "target.organism.name": "mouse",
            "target.label": "POLR2A",
            "award.project": "ENCODE",
            "@id": "/experiments/ENCSR000CBM/",
            "@type": [
                "experiment",
                "dataset",
                "item"
            ]
        },
...

Whereas before, when we retrieved a single object, the @graph property was a list with only one element. Now @graph is a multi-element list, with each object that satisfies the search condition represented as one entry.

Notice that the experiment objects returned contain only a subset of the properties. These are the properties that the server returns to the web app to render the search results page. To get the full objects back, add &frame=object to the query. The query looks like:

# This searches the ENCODE database for the phrase "bone chip"
URL = "https://www.encodeproject.org/search/?searchTerm=bone+chip&frame=object&format=json"

This returns the full objects, with all their properties and values.

{
    "title": "Search",
    "notification": "Success",
    "@graph": [
        {
            "files": [
                "/files/ENCFF001LFR/",
                "/files/ENCFF001LFT/",
                "/files/ENCFF001LFU/",
                "/files/ENCFF001LFX/",
                "/files/ENCFF001LGH/",
                "/files/ENCFF001LGI/",
                "/files/ENCFF001XZU/"
            ],
            "system_slims": [
                "skeletal system",
                "immune system"
            ],
            "possible_controls": [
                "/experiments/ENCSR000CAS/"
            ],
            "original_files": [
                "/files/ENCFF001LFR/",
                "/files/ENCFF001LFT/",
                "/files/ENCFF001LFU/",
                "/files/ENCFF001LFX/",
                "/files/ENCFF001LGH/",
                "/files/ENCFF001LGI/",
                "/files/ENCFF001XZU/"
            ],
            "accession": "ENCSR000CBL",
            "replicates": [
                "/replicates/7c570f54-519d-46b6-a418-5cff7291f920/",
                "/replicates/bd65bc6e-50e6-4a40-90e8-79b312274267/"
            ],
            "references": [],
            "alternate_accessions": [],
            "aliases": [],
            "submitted_by": "/users/4f6e1132-f893-4011-8197-848187303a10/",
            "documents": [],
            "uuid": "522579b3-86a3-4f04-90d8-efa87ee9f84a",
            "biosample_type": "tissue",
...

Embedded objects can be fully expanded in the search results with &frame=embedded so that the query would look like:

# This searches the ENCODE database for the phrase "bone chip"
URL = "https://www.encodeproject.org/search/?searchTerm=bone+chip&frame=embedded&format=json&type=experiment"

The output should start with something like:

{
    "title": "Search",
    "notification": "Success",
    "@graph": [
        {
            "files": [
                {
                    "status": "CURRENT",
                    "submitted_file_name": "mm9/wgEncodeLicrTfbs/wgEncodeLicrTfbsBmarrowCtcfMAdult8wksC57bl6StdAlnRep1.bam",
                    "assembly": "mm9",
                    "submitted_by": {
                        "title": "Lee Edsall",
                        "@id": "/users/4f6e1132-f893-4011-8197-848187303a10/",
                        "uuid": "4f6e1132-f893-4011-8197-848187303a10",
                        "lab": "/labs/bing-ren/",
                        "@type": [
                            "user",
                            "item"
                        ]
                    },
                    "file_format": "bam",
                    "md5sum": "19a4854564b4635c461d2cc16ca10647",
                    "accession": "ENCFF001LFR",
                    "schema_version": "1",
                    "dataset": "/experiments/ENCSR000CBL/",
                    "download_path": "2013/4/18/ENCFF001LFR.bam",
                    "replicate": {
                        "status": "CURRENT",
                        "submitted_by": "/users/4f6e1132-f893-4011-8197-848187303a10/",
                        "biological_replicate_number": 1,
                        "uuid": "7c570f54-519d-46b6-a418-5cff7291f920",
                        "antibody": "/antibody-lots/ENCAB210NHK/",
                        "flowcell_details": [],
                        "technical_replicate_number": 1,
...

Additional search examples

  1. Every object that matches the string “CTCF”:
    https://www.encodeproject.org/search/?searchTerm=CTCF&format=json&frame=object
  2. A file with a particular MD5 7b9f8ccd15fea0bda867e042db2b6f5a:
    https://www.encodeproject.org/search/?type=file&md5sum=7b9f8ccd15fea0bda867e042db2b6f5a&format=json
  3. All the file objects from a particular experiment ENCSR000AKS (with links to embedded objects):
    https://www.encodeproject.org/search/?type=file&dataset=/experiments/ENCSR000AKS/&format=json&frame=object&limit=all
  4. All the fastq files from a particular experiment ENCSR000AKS:
    https://www.encodeproject.org/search/?type=file&dataset=/experiments/ENCSR000AKS/&file_format=fastq&format=json&frame=object&limit=all
  5. All the replicates for a particular experiment ENCSR000AKS:
    https://www.encodeproject.org/search/?type=replicate&experiment.accession=ENCSR000AKS&format=json&limit=all
  6. The same query as above but with embedded objects expanded:
    https://www.encodeproject.org/search/?type=replicate&experiment.accession=ENCSR000AKS&format=json&frame=embedded&limit=all
  7. All biosamples (abbreviated metadata):
    https://www.encodeproject.org/search/?type=biosample&limit=all
  8. All biosamples (full metadata with object references):
    https://www.encodeproject.org/search/?type=biosample&frame=object&limit=all&format=json

Summary of search features

  • If you do not specify frame=something, you will get a subset of object properties as defined by the webapp. If what you need is in that subset (like accession, maybe), it’s a relatively fast way of getting back a long list of objects.
  • If you specify frame=embedded you will get all the object properties, with selected embedded objects expanded.
  • If you specify frame=object you will get all the object properties, with embedded objects.
  • frame=object will always give you all of the properties, with embedded objects referred to by an identifier. So it’s consistent, and it’s faster. Which embedded objects are expanded by frame=embedded may change in the future as our requirements for search change. Also the depth of expansion is chosen to support search, and might change, as well. So for the most robust code you might choose to use frame=object and then GET only the embedded objects you need.
  • You do not need to specify frame=embedded in order to search within embedded objects. The example above which searches inside the embedded award object works fine with frame=object, even though one of the search terms (“award.rfa”) searches within an embedded object.
  • All of the results from search can be visualized within the web app by opening the URL in your browser, omitting format=json. Not all objects have fancy rendered collections or pages, but you will always get back a clickable list.
  • limit=all is not strictly necessary but without it your result will contain only the first 25 hits (or all if < 25 hits). We plan to implement a mechanism to return very large searches in batches, but for now limit=all can generate large result sets.
  • Giving searchTerm=somestring will often return multiple object types. You can look into the @type property of each returned object for its data type.