Introduction

Direct interaction with the ENCODE DCC metadata database is typically done through scripts you write and execute on your own Mac, PC or server. Your scripts interact with the database through an industry-standard, Hypertext-Transfer-Protocol-(HTTP)-based, Representational-state-transfer (RESTful) application programming interface (API). Data objects exchanged with the server conform to the standard JavaScript Object Notation (JSON) format. JSON is a defined data interchange format that is used web-wide. If you’re familiar with JavaScript’s object literal notation or Python’s dictionary datatype, then JSON-formatted text will look familiar. You will almost certainly use libraries for your language of choice to handle the network connection and parse the objects returned (requests and json for Python, for example). We have written example scripts you can look at here.


Quickstart examples

While getting objects from the database can often be done in just a few lines of code, here are some quick ways of testing your connection and exploring the ENCODE data objects without writing any code.

Prerequisites

  1. A JSON pretty-printer plugin for your web browser, such as JSONView (for Chrome or Firefox) or json-lite (for Safari)
  2. The curl command, which ships with Mac OSX and all major LINUX/UNIX implementations. If you don't have curl, an executable can be downloaded from here. Make sure you get the SSL and SSH aware version. You will find curl useful later for debugging your scripts (it always works), so if you don't already have it, you'll want to get it.

ENCODE JSON in your browser

Look at biosample ENCBS000AAA at https://www.encodeproject.org/biosamples/ENCBS000AAA/.

By clicking the JSON button, which looks like {;}, you should see the page viewed as JSON. Alternatively, you can manually add the following parameter to the URL: /?format=json

By adding the /?format=json parameter, you've instructed the ENCODE webserver to send you the raw contents of the database record for ENCBS000AAA, in JSON format. Rather than being rendered in your browser, you see the native JSON object as it is stored in the ENCODE database. This is a biosample object, and you will see all the properties and their values for this object. You need a JSON pretty printer (see prerequisites above) in order to make sense of the JSON. Notice that many of the object’s properties contain other (nested) objects.

Notice also that the object has its own URL, and that the URL corresponds to the object’s accession number. This always works to access ENCODE objects. In fact, because accession numbers are unique, you can construct a URL from the accession number alone and the server will return your object. For example, the resource https://www.encodeproject.org/ENCBS000AAA maps to the same object identified by the longer URL https://www.encodeproject.org/biosamples/ENCBS000AAA/

As you learn more about the ENCODE REST API, you’ll see that when you retrieve objects (using the HTTP “GET” method) from the server, they always come back as JSON.

A simple command-line GET

Try this in a terminal window:

$ curl -H "Accept: application/json" https://www.encodeproject.org/biosamples/ENCBS000AAA/

You should see lots of text that starts with something like "{"system_slims": [], "accession": "ENCBS000AAA", ...".

Your command instructs curl to download the database record for the ENCODE biosample ENCBS000AAA in JSON format (it's the same object rendered above and at https://www.encodeproject.org/biosamples/ENCBS000AAA/). In fact, the ENCODE server sees that the request is coming from something other than a web browser running the ENCODE web app, so it returns the object in JSON format. All curl is really doing is downloading the "web page" that corresponds to that ENCODE record. Transport is by HTTP, just like a web browser.

This sort of JSON-formatted text is sometimes referred to as a "JSON document". You can redirect the output to a file:

$ curl -L -H "Accept: application/json" https://www.encodeproject.org/biosamples/ENCBS000AAA/ > ENCBS000AAA.json
$ less ENCBS000AAA.json

In this example, the -L option tells curl to follow redirects. While not necessary in this example, -L is necessary for identifiers, like aliases, that get redirected to another resource. In cases like that, without -L you may see something like:

{"status": "error", "code": 301, "description": "Moved Permanently" ... }

If in doubt, use curl -L.

If you supply an invalid identifier (like an accession number that doesn't refer to anything), you'll see something like:

{"status": "error", "code": 404, "description": "The resource could not be found." ... }

You can see that even error responses come back in JSON format. This makes parsing the results of your queries much easier.

To pretty-print JSON on the command line, you can use the jq utility, available here. To use it with the output of curl, you can simply pipe its output to jq:

curl -H "Accept: application/json" https://www.encodeproject.org/biosamples/ENCBS000AAA/ | jq

REST, JSON and curl in perspective

As you develop your scripts, curl can be a valuable testing tool. If you can get an object with a curl command, you should be able to get it with your script.

The important thing to remember is that each ENCODE object is accessible via a simple URL containing an identifier (like an accession number) for that object. That's all curl does, that's all your browser does, and that's all your scripts will do. The resource you GET, using that URL, will be in JSON format. You will then use a library to parse the JSON and construct a native data structure (like a Python dictionary).

The ENCODE REST API

The ENCODE REST API uses GET to transport JSON-formatted information between the server and your scripts.

Prerequisites

  1. A library or module for your language of choice that supports HTTP. The requests library here for Python is good, and will be used for examples in this documentation.
  2. A library or module for parsing JSON-structured text and building native data structures. The Python json library here is good.

GET

The HTTP GET request is used to retrieve objects from the ENCODE server. GET will work without a username and password to fetch publicly-released ENCODE objects. Sending a GET request is accomplished here by using the requests.get() method.

import requests, json
 
# Force return from the server in JSON format
headers = {'accept': 'application/json'}
 
# This URL locates the ENCODE biosample with accession number ENCBS000AAA
url = 'https://www.encodeproject.org/biosample/ENCBS000AAA/?frame=object'
 
# GET the object
response = requests.get(url, headers=headers)
 
# Extract the JSON response as a Python dictionary
biosample = response.json()

# Print the Python object
print(json.dumps(biosample, indent=4))

This script GET's the JSON representation of ENCODE biosample ENCBS000AAA. The ENCODE server's JSON response is extracted as a Python dictionary using the response.json() method. The dictionary is then dumped back to a string and pretty-printed with useful indentation.

The output should look something like:

{
    "biosample_ontology": "/biosample-types/cell_line_EFO_0001203/",
    "description": "mammary gland, adenocarcinoma",
    "donor": "/human-donors/ENCDO000AAE/",
    "passage_number": 5,
    "lab": "/labs/richard-myers/",
    "accession": "ENCBS000AAA",
    "parent_of": [],
    "@type": [
        "Biosample",
        "Item"
    ],
    "url": "http://www.atcc.org/Products/All/HTB-22.aspx",
    "status": "released",
    "uuid": "56e94f2b-25ac-4c58-9828-f63b66220999"
    "notes": "(PMID: 4357757)",
    "award": "/awards/U54HG004576/",
    "age_display": "69 year",
    "product_id": "HTB-22",
    "culture_harvest_date": "2012-04-10",
    "applied_modifications": [],
    "dbxrefs": [
        "UCSC-ENCODE-cv:MCF-7"
    ],
    "@id": "/biosamples/ENCBS000AAA/",
    "culture_start_date": "2012-03-16",
    "life_stage": "adult",
    "organism": "/organisms/human/",
    "aliases": [
        "richard-myers:MCF7-003"
    ],
    "health_status": "breast cancer (adenocarcinoma)",
    "internal_tags": [],
    "references": [],
    "age_units": "year",
    "characterizations": [],
    "submitted_by": "/users/df9f3c8e-b819-4885-8f16-08f6ef0001e8/",
    "date_created": "2013-12-12T05:50:02.101495+00:00",
    "alternate_accessions": [],
    "source": "/sources/atcc/",
    "age": "69",
    "summary": "Homo sapiens MCF-7 cell line",
    "genetic_modifications": [],
    "schema_version": "24",
    "sex": "female",
    "treatments": [],
    "documents": [
        "/documents/984071d4-9149-476a-b353-93592c6f48f3/"
    ]
}

The important point is that the response object has been cast into a native Python dict datastructure. For example, documents is an array of document identifiers and so you can loop over that array.

Adding the following code to the example script above loops through the documents array, GET's each document object, and prints its description:

for doc_uri in biosample['documents']:
	doc_response = requests.get('https://www.encodeproject.org/' + doc_uri, headers=headers)
	document = doc_response.json()
	print document['description']

Which prints:

MCF-7 Cell Culture and 4-hydroxytamoxifen treatment

Programmatic search

In the GET example above, we use the ENCODE REST API to retrieve an individual object using its ENCODE accession number. You can also search for objects programmatically and get the search result back in JSON format. The ENCODE web app is a good place to start to see an example. Enter the string "bone chip" in the search box at the top right of the page at https://www.encodeproject.org/. The result are those ENCODE objects that match that string. Note that these results are by default not restricted to experiments, but include other object types as well. The object types of the results, among other properties, can be filtered using the facets to the left of the results.

Search URL format

The same URL that returns the search results to the web app can be used in a script. This script does the same search and returns the results in JSON format:

import requests, json

# Force return from the server in JSON format
headers = {'accept': 'application/json'}

# This searches the ENCODE database for the phrase "bone chip"
url = 'https://www.encodeproject.org/search/?searchTerm=bone+chip&frame=object'

# GET the search result
response = requests.get(url, headers=headers)

# Extract the JSON response as a python dictionary
search_results = response.json()

# Print the object
print(json.dumps(search_results, indent=4))

The structure of the response will be as follows. Keys with array values are denoted with [...] and keys with object values are denoted with {...}:

{
    "notification": "Success",
    "title": "Search",
    "@id": "/search/?searchTerm=bone+chip",
    "clear_filters": "/search/?searchTerm=bone+chip",
    "columns": {...},
    "all": "/?searchTerm=bone+chip&limit=all&format=json",
    "facets": [...],
    "@context": "/terms/",
    "@type": [...],
    "total": 180,
    "filters": [...],
    "@graph": [...]
}

The @graph property in the response script contains the array of search results, it should look something like this:

[
    {
        "authors": "Johnson ME, Deliard S, Zhu F, Xia Q, Wells AD, Hankenson KD, Grant SF",
        "identifiers": [
            "PMID:24337390"
        ],
        "title": "A ChIP-seq-defined genome-wide map of MEF2C binding reveals inflammatory pathways associated with its role in bone density determination.",
        "@id": "/publications/03e13868-6090-4f61-b4f1-b4a22a6e833a/",
        "issue": "4",
        "data_used": "ENCODE TF ChIP used throughout, Figures 1-2",
        "journal": "Calcified tissue international",
        "volume": "94",
        "date_created": "2014-08-13T01:26:28.846038+00:00",
        "@type": [
            "Publication",
            "Item"
        ],
        "date_published": "2014 Apr",
        ...
    },
    {
     "biosample_ontology": {
            "term_name": "bone marrow",
            "classification": "tissue"
        },
        "replicates": [
            {
                "antibody": {
                    "accession": "ENCAB000ADX"
                },
                "biological_replicate_number": 1,
                "technical_replicate_number": 1,
                "@id": "/replicates/8ca96774-c0d1-4f62-b7d3-f2988d406de2/",
                "library": {
                    "biosample": {
                        "age_units": "week",
                        "life_stage": "adult",
                        "organism": {
                            "scientific_name": "Mus musculus"
                        },
                        "accession": "ENCBS130ENC",
                        "age": "8"
                    }
                }
            },
            ...
        ]
    }, 
    ...
]

Whereas before, when we retrieved a single object, the @graph property was a list with only one element. Now @graph is a multi-element list, with each object that satisfies the search condition represented as one entry. Note that there are multiple types of objects returned by the query because we did not specify a type.

The objects returned contain only a subset of the properties. These are the properties that the server returns to the web app to render the search results page. To get the full objects back, add &frame=object to the query. The query URL looks like:

url = 'https://www.encodeproject.org/search/?searchTerm=bone+chip&frame=object&format=json'

This returns the full objects, with all their properties and values. Here is the @graph array in the response:

[
    {
        "data_used": "ENCODE TF ChIP used throughout, Figures 1-2",
        "categories": [
            "basic biology"
        ],
        "status": "released",
        "volume": "94",
        "@type": [
            "Publication",
            "Item"
        ],
        "uuid": "03e13868-6090-4f61-b4f1-b4a22a6e833a",
        "submitted_by": "/users/ce2bde01-07ec-4b8a-b179-554ef95b71dd/",
        "@id": "/publications/03e13868-6090-4f61-b4f1-b4a22a6e833a/",
        "page": "396-402",
        "issue": "4",
        "authors": "Johnson ME, Deliard S, Zhu F, Xia Q, Wells AD, Hankenson KD, Grant SF",
        "published_by": [
            "community"
        ],
        "date_created": "2014-08-13T01:26:28.846038+00:00",
        "supplementary_data": [],
        "identifiers": [
            "PMID:24337390"
        ],
        ...
    },
    {
        "biosample_ontology": "/biosample-types/tissue_UBERON_0002371/",
        "aliases": [],
        "status": "released",
        "files": [
            "/files/ENCFF001JXO/",
            "/files/ENCFF001JXP/",
            "/files/ENCFF001JXQ/",
            ...,
        ],
        "uuid": "2ed470f3-a57a-4ca4-83fe-d0bfa7022a36",
        "documents": [],
        "@id": "/experiments/ENCSR000CAG/",
        "description": "H3K4me1 ChIP-seq on 8-week mouse bone marrow",
        "replicates": [
            "/replicates/8ca96774-c0d1-4f62-b7d3-f2988d406de2/",
            "/replicates/0cb223fe-4893-4068-aaad-43b0398bc5da/"
        ],
        ...
    },
    ...
]

Embedded objects can be fully expanded in the search results with &frame=embedded so that the query would look like:

url = 'https://www.encodeproject.org/search/?searchTerm=bone+chip&frame=embedded&format=json'

The @graph of the response should start with something like:

[
    {
        "supplementary_data": [],
        "authors": "Johnson ME, Deliard S, Zhu F, Xia Q, Wells AD, Hankenson KD, Grant SF",
        "lab": "/labs/encode-consortium/",
        "volume": "94",
        "published_by": [
            "community"
        ],
        "data_used": "ENCODE TF ChIP used throughout, Figures 1-2",
        "identifiers": [
            "PMID:24337390"
        ],
        "categories": [
            "basic biology"
        ],
        "journal": "Calcified tissue international",
        "page": "396-402",
        "date_published": "2014 Apr",
        "award": "/awards/ENCODE/",
        "submitted_by": "/users/ce2bde01-07ec-4b8a-b179-554ef95b71dd/",
        "date_created": "2014-08-13T01:26:28.846038+00:00",
        "@type": [
            "Publication",
            "Item"
        ],
        ...
    },
    {
        "date_submitted": "2011-06-24",
        "superseded_by": [],
        "assay_term_name": "ChIP-seq",
        "lab": {
            "name": "bing-ren",
            "status": "current",
            "title": "Bing Ren, UCSD",
            "address2": "9500 Gilman Drive, Room 231",
            "@type": [
                "Lab",
                "Item"
            ],
            "state": "CA",
            "pi": "/users/ea5e2bd6-be7a-4b18-a2e8-dee86005422f/",
            ...,
        },
        ...
    },
    ...
]

Note, for example, the lab properties of results in the second query are objects, whereas in the previous search with frame=object they were simply paths to lab objects on the portal.

Additional search examples

Note: the last four examples here return a lot of data, so the format=json parameter has been omitted so as to avoid attempting to render it using an in-browser JSON formatter.

  1. Every object that matches the string “CTCF”:
    https://www.encodeproject.org/search/?searchTerm=CTCF&format=json&frame=object
  2. A file with a particular MD5 7b9f8ccd15fea0bda867e042db2b6f5a:
    https://www.encodeproject.org/search/?type=File&md5sum=7b9f8ccd15fea0bda867e042db2b6f5a&format=json
  3. All the file objects from a particular experiment ENCSR000AKS (with links to embedded objects):
    https://www.encodeproject.org/search/?type=File&dataset=/experiments/ENCSR000AKS/&format=json&frame=object
  4. All the fastq files from a particular experiment ENCSR000AKS:
    https://www.encodeproject.org/search/?type=File&dataset=/experiments/ENCSR000AKS/&file_format=fastq&format=json&frame=object
  5. All the replicates for a particular experiment ENCSR000AKS:
    https://www.encodeproject.org/search/?type=Replicate&experiment.accession=ENCSR000AKS&format=json
  6. The same query as above but with embedded objects expanded:
    https://www.encodeproject.org/search/?type=Replicate&experiment.accession=ENCSR000AKS&format=json&frame=embedded
  7. All biosamples (abbreviated metadata):
    https://www.encodeproject.org/search/?type=Biosample
  8. All biosamples (full metadata with object references):
    https://www.encodeproject.org/search/?type=Biosample&frame=object
  9. All experiments that are not ChIP-seq (note the use of != in the query):
    https://www.encodeproject.org/search/?type=Experiment&assay_title!=ChIP-seq&frame=object
  10. All treated biosamples (note the use of * in the query):
    https://www.encodeproject.org/search/?type=Biosample&treatments=*&frame=object

Summary of search features

  • If you do not specify frame=something, you will get a subset of object properties as defined by the webapp. If what you need is in that subset (like accession, maybe), it’s a relatively fast way of getting back a long list of objects.
  • If you specify frame=embedded you will get all the object properties, with selected embedded objects expanded.
  • frame=object will always give you all of the properties, with embedded objects referred to by an identifier. So it’s consistent, and it’s faster. Which embedded objects are expanded by frame=embedded may change in the future as our requirements for search change. Also, the depth of expansion is chosen to support search and might change as well. So, for the most robust code you might choose to use frame=object and then GET only the embedded objects you need.
  • You do not need to specify frame=embedded in order to search within embedded objects. For example, the query paramater lab.state=CA would work when searching with frame=object
  • All of the results from search can be visualized within the web app by opening the URL in your browser, omitting format=json. Not all objects have fancy rendered collections or pages, but you will always get back a clickable list.
  • limit=all is not strictly necessary but without it your result will contain only the first 25 hits (or all if < 25 hits).
  • limit=all should always be used in conjunction with format=json.
  • We plan to implement a mechanism to return very large searches in batches, but for now limit=all can generate large result sets.
  • Giving searchTerm=somestring will often return multiple object types. You can look into the @type property of each returned object for its data type.
  • To make a search specifying that a property is NOT equal to a value (negation), use the != operator (see example 9 above)
  • The * character can be used as a wildcard, matching any value. This is useful when you would like to check the existence or non-existence of a property. Specifing property=* will return objects for which the property is both present and non-empty. It can be used with != as well to search for objects where a property is missing or empty.

Summary of syntax features

Parameter or syntax feature Example Description
frame=object https://www.encodeproject.org/biosample/ENCBS000AAA/?frame=object Returns all properties of the object, with any embedded objects referred to by an identifier. Note that if there is no "frame" parameter in the query, a subset of object properties as defined by the webapp, which is subject to change.
frame=embedded https://www.encodeproject.org/biosample/ENCBS000AAA/?frame=embedded Returns all properties of the object, with selected embedded objects expanded, subject to change.
limit

https://www.encodeproject.org/search/?type=Biosample&limit=all&format=json

https://www.encodeproject.org/search/?type=Biosample&limit=20

Returns up to the specified number of results, or returns all results in the case of limit=all .
searchTerm https://www.encodeproject.org/search/?type=Experiment&searchTerm=CTCF Returns results matching the specified search string.
lt https://www.encodeproject.org/search/?type=File&quality_metrics.frip=lt:0.65 Range syntax for "less than" for use with numeric properties.
lte https://www.encodeproject.org/search/?type=File&quality_metrics.frip=lte:0.65 Range syntax for "less than or equal" for use with numeric properties.
gt https://www.encodeproject.org/search/?type=File&file_size=gt:1000000 Range syntax for "greater than" for use with numeric properties.
gte https://www.encodeproject.org/search/?type=File&file_size=gte:1000000 Range syntax for "greater than or equal" for use with numeric properties.
* https://www.encodeproject.org/search/?type=Biosample&treatments=* Wildcard character, which can be used to check the existence/non-existence of a property.
!= https://www.encodeproject.org/search/?type=Biosample&treatments!=* Negation. Query results will exclude results matching the specified value.