File Submission

Prerequisites

Metadata

Before you submit data files to the DCC, you should have already registered and received an accession number for the following metadata from the DCC (if they are relevant to your assay):

  • Experiment accession [required for file submission]
  • Replicates [required for submission of raw data files]
  • Biosamples and donors
  • Antibodies and validations

Data files will not be released until these have been reviewed as being accurate and consistent.

File validation software

Download

The software package needed to validate your files is compiled specifically for linux machines. Make sure to test your server environment before downloading the package.

The contents of the package include:

  • validateManifest, a pre-compiled utility [this is no longer current or updated but provided in case labs wish to use it]
  • validateFiles, a pre-compiled utility
  • encValData/, a folder of files needed to run both utilities

The current software (version 1.9) can be downloaded: Encode3 Validation Package

Or via wget

> wget http://hgwdev.cse.ucsc.edu/~galt/encode3/validatePackage/validateEncode3-latest.tgz

Untar the package using:

> tar -xvf [package]

Installation

You will either need to unpackage the software in the same directory as your file submission scripts or add the location of the software in your path.

AWS CLI

You will need to install the Amazon command line interface tool: http://aws.amazon.com/cli/

File submission

Introductory help docs

Check the file format and calculate md5sum

Before you submit the file, please check the file format using validateFiles (see installation instructions). Also, please calculate the md5sum. The example script listed above describes how it can be bundled with the file object submission and file submission steps.

Submit a file object

Submit a file object with the following required metadata:

Like other objects, you can submit an alias for the file object that is your local unique name for that file. Please note you can improve the viewing of json in your browser with a JSON pretty-printer plugin.

Specific for fastq files

The following properties are required for fastq files:

  • replicate [link to the UUID of the correct replicate or the alias]
  • paired_end (1 or 2 if the files are paired_end )
  • paired_with (if the file is a fastq, csfasta, or csqual file_format and the paired_end property = 2, the accession or alias of the file that is paired_end = 1 must be included in the file object)

These properties are highly encouraged to be submitted since it can be helpful in making sure the right file is linked up to the right replicate:

  • platform [link to the platform used to generate the fastq file]
  • flowcell_details [includes the machine details and flowcell information]

Additionally, fastq files should not include the assembly

For processed files

The following properties are required for processed files:

These properties are highly encouraged to submit file relationships since this information will be used to generate the schematic of file relationships:

  • derived_from [this list of files that served as input to generate the file you are submitting]
  • controlled_by [this list of files that served as controls]

Submit the file

ENCODE cloud credentials

Once you submit the file object, the JSON response will include the credentials that allow the file to be submitted. Here is an example snippet of the response

{
    "@id": "/files/TSTFF442764/", 
    "@type": [
        "file", 
        "item"
    ], 
    "accession": "TSTFF442764", 
    "upload_credentials": {
        "access_key": "ASIAJ2H5Y4GZGL2TRTZQ", 
        "expiration": "2015-01-09T10:45:53Z", 
        "federated_user_arn": "arn:aws:sts::618537831167:federated-user/upload-1420757153.21-TSTFF442764", 
        "federated_user_id": "618537831167:upload-1420757153.21-TSTFF442764", 
        "request_id": "1a0614f8-9788-11e4-a964-bd2a3d078910", 
        "secret_key": "I66DAE8icPt6teAGSmFgBJRoT6cVGQMxUUDeTQjU", 
        "session_token": "AQoDYXdzEDgaoAMIBjQhrS7u9v9eeqB466oOSZX3lujjFFTXQ5/R0rSAMnKxBPWMqpZuT4wo2qfYCLn4Db1Kq4Z/ff9rsfEcc8rf4LAAUM5b9vSrQ0v4tuZ49XeZ1dIxAX+wDFnDZRmLRTY6G2XvzA55n4EgVN5Gt1NLCdhYn821/Sgnm4fcoDH7VvU/OB3XChTpDGIER62iWoEj9sNg1/Sqd0pkrm8iLNBwQZbWmBWgW40kE/3E413s2LeIUTgv8rYqyM4N79X2UnhHOgJ6VBLcFyfi02TaziicULC0erkTuZHrvjOgHTvWudnPrrH4A6aRuYo+WlJ1c15Df40UanSidQ8i03dSd8Ib6AJcvl8mqBcdGqt2Uoc09usSsO7FIZN47iQqDnonnQIpxaRhGbSFUrnwKG1PqsFdsEBLtYKgsXuqeH1swC+0IUlyVDbyUdVaTTjw8LoBd2Tp4wZVdnso4HtKD7EYnXP+trU89xmPyTNBm7oAnooD26mBmq1F8Y/8bLsm1LFHpKK+RmJ0QoDKAYTN+B5f8Yi7FPxGVZWBqKptlzmdTP5xCyChkbylBQ==", 
        "upload_url": "s3://encoded-files-dev/2015/01/08/b43d0f62-82e0-44bf-b8dd-cf89bec1b3bf/TSTFF442764.fastq.gz"
    }, 
    "uuid": "b43d0f62-82e0-44bf-b8dd-cf89bec1b3bf"
}

The submission script example above includes how this response is used to submit the file to the ENCODE cloud

Managing credentials

The credentials are good for 36 hours. You can manage the credentials by by appending "/upload" to the file URL:

https://www.encodeproject.org/<ENCFF accession>/upload

Retrieving existing credentials

If it is still within 36 hours of the file object being posted, you can retrieve the existing credentials using GET.

Renewing credentials

After 36 hrs, a new set of credentials will need to be issued. You can create new credentials with a POST to that URL, passing in an empty object.

Advice about Workflow

  • Use https://test.encodedcc.org/ to test out the scripts - you will get TST accessions and these files will be in our dev bucket.
  • The "paired_with" and "derived_from" properties require that the file you refer to exist before submitting this information. The files can be submitted in this order to ensure the links are present. This is an example for a set of fastq's, bams, and bigWigs.
    1. Submit the fastq's that are "paired_end = 1". In the response, you will get an ENCFF number for the file.
    2. Submit the fastq's that are "paired_end = 2". You must refer to the paired_end = 1 file in the "paired_with" properties. You can refer to the already submitted file by ENCFF, alias, or md5sum. You will get an ENCFF number for this file.
    3. Submit the BAMs, include the 2 fastq's in the "derived_from" properties. You can refer to the already submitted file by ENCFF, alias, or md5sum. You will get an ENCFF number for this file.
    4. Submit the bigWigs, refer to the BAM. You can refer to the already submitted file by ENCFF, alias, or md5sum. You will get an ENCFF number for each bigWig.