File Submission
Prerequisites
Metadata
Before you submit data files to the DCC, you should have already registered and received an accession number for the following metadata from the DCC (if they are relevant to your assay):
- Experiment accession [required for file submission]
- Replicates [required for submission of raw data files]
- Biosamples and donors
- Antibodies and validations
Data files will not be released until these have been reviewed as being accurate and consistent.
File validation software
Download
The software package needed to validate your files is compiled specifically for linux machines. Make sure to test your server environment before downloading the package.
The contents of the package include:
- validateManifest, a pre-compiled utility [this is no longer current or updated but provided in case labs wish to use it]
- validateFiles, a pre-compiled utility
- encValData/, a folder of files needed to run both utilities
The current software (version 1.9) can be downloaded: Encode3 Validation Package
Or via wget
> wget http://hgwdev.cse.ucsc.edu/~galt/encode3/validatePackage/validateEncode3-latest.tgz
Untar the package using:
> tar -xvf [package]
Installation
You will either need to unpackage the software in the same directory as your file submission scripts or add the location of the software in your path.
AWS CLI
You will need to install the Amazon command line interface tool: http://aws.amazon.com/cli/
File submission
Introductory help docs
- A short presentation our lead programmer gave at the ENCODE consortium meeting about the file submission process is available:https://docs.google.com/presentation/d/1a5q9u5rGF0b0rhEmJQ1KG2fNi5wsJI1I7b-UZ_RX8ho/edit#slide=id.p
- Example script we provide to our submitters wraps up the validation, md5sum and file_size calculation steps, then posts a file object, retrieves the authentication tokens and submits the file via the awscli utility: https://github.com/ENCODE-DCC/encoded/blob/master/examples/submit_file.py
Check the file format and calculate md5sum
Before you submit the file, please check the file format using validateFiles (see installation instructions). Also, please calculate the md5sum. The example script listed above describes how it can be bundled with the file object submission and file submission steps.
Submit a file object
Submit a file object with the following required metadata:
- dataset [the experiment accession]
- file_format [see the full list of valid enums at https://www.encodeproject.org/profiles/file.json]
- output_type [see the full list of valid enums at https://www.encodeproject.org/profiles/file.json]
- md5sum [calculate it locally, this is used to check we got the entire file]
- submitted_file_name [this is the location of your file locally]
- award
- lab
Like other objects, you can submit an alias for the file object that is your local unique name for that file. Please note you can improve the viewing of json in your browser with a JSON pretty-printer plugin.
Specific for fastq files
The following properties are required for fastq files:
- replicate [link to the UUID of the correct replicate or the alias]
- paired_end (1 or 2 if the files are paired_end )
- paired_with (if the file is a fastq, csfasta, or csqual file_format and the paired_end property = 2, the accession or alias of the file that is paired_end = 1 must be included in the file object)
These properties are highly encouraged to be submitted since it can be helpful in making sure the right file is linked up to the right replicate:
- platform [link to the platform used to generate the fastq file]
- flowcell_details [includes the machine details and flowcell information]
Additionally, fastq files should not include the assembly
For processed files
The following properties are required for processed files:
- assembly [see the full list of valid enums at https://www.encodeproject.org/profiles/file.json]
These properties are highly encouraged to submit file relationships since this information will be used to generate the schematic of file relationships:
- derived_from [this list of files that served as input to generate the file you are submitting]
- controlled_by [this list of files that served as controls]
Submit the file
ENCODE cloud credentials
Once you submit the file object, the JSON response will include the credentials that allow the file to be submitted. Here is an example snippet of the response
{ "@id": "/files/TSTFF442764/", "@type": [ "file", "item" ], "accession": "TSTFF442764", "upload_credentials": { "access_key": "ASIAJ2H5Y4GZGL2TRTZQ", "expiration": "2015-01-09T10:45:53Z", "federated_user_arn": "arn:aws:sts::618537831167:federated-user/upload-1420757153.21-TSTFF442764", "federated_user_id": "618537831167:upload-1420757153.21-TSTFF442764", "request_id": "1a0614f8-9788-11e4-a964-bd2a3d078910", "secret_key": "I66DAE8icPt6teAGSmFgBJRoT6cVGQMxUUDeTQjU", "session_token": "AQoDYXdzEDgaoAMIBjQhrS7u9v9eeqB466oOSZX3lujjFFTXQ5/R0rSAMnKxBPWMqpZuT4wo2qfYCLn4Db1Kq4Z/ff9rsfEcc8rf4LAAUM5b9vSrQ0v4tuZ49XeZ1dIxAX+wDFnDZRmLRTY6G2XvzA55n4EgVN5Gt1NLCdhYn821/Sgnm4fcoDH7VvU/OB3XChTpDGIER62iWoEj9sNg1/Sqd0pkrm8iLNBwQZbWmBWgW40kE/3E413s2LeIUTgv8rYqyM4N79X2UnhHOgJ6VBLcFyfi02TaziicULC0erkTuZHrvjOgHTvWudnPrrH4A6aRuYo+WlJ1c15Df40UanSidQ8i03dSd8Ib6AJcvl8mqBcdGqt2Uoc09usSsO7FIZN47iQqDnonnQIpxaRhGbSFUrnwKG1PqsFdsEBLtYKgsXuqeH1swC+0IUlyVDbyUdVaTTjw8LoBd2Tp4wZVdnso4HtKD7EYnXP+trU89xmPyTNBm7oAnooD26mBmq1F8Y/8bLsm1LFHpKK+RmJ0QoDKAYTN+B5f8Yi7FPxGVZWBqKptlzmdTP5xCyChkbylBQ==", "upload_url": "s3://encoded-files-dev/2015/01/08/b43d0f62-82e0-44bf-b8dd-cf89bec1b3bf/TSTFF442764.fastq.gz" }, "uuid": "b43d0f62-82e0-44bf-b8dd-cf89bec1b3bf" }
The submission script example above includes how this response is used to submit the file to the ENCODE cloud
Managing credentials
The credentials are good for 36 hours. You can manage the credentials by by appending "/upload" to the file URL:
https://www.encodeproject.org/<ENCFF accession>/upload
Retrieving existing credentials
If it is still within 36 hours of the file object being posted, you can retrieve the existing credentials using GET.
Renewing credentials
After 36 hrs, a new set of credentials will need to be issued. You can create new credentials with a POST to that URL, passing in an empty object.
Advice about Workflow
- Use https://test.encodedcc.org/ to test out the scripts - you will get TST accessions and these files will be in our dev bucket.
- The "paired_with" and "derived_from" properties require that the file you refer to exist before submitting this information. The files can be submitted in this order to ensure the links are present. This is an example for a set of fastq's, bams, and bigWigs.
- Submit the fastq's that are "paired_end = 1". In the response, you will get an ENCFF number for the file.
- Submit the fastq's that are "paired_end = 2". You must refer to the paired_end = 1 file in the "paired_with" properties. You can refer to the already submitted file by ENCFF, alias, or md5sum. You will get an ENCFF number for this file.
- Submit the BAMs, include the 2 fastq's in the "derived_from" properties. You can refer to the already submitted file by ENCFF, alias, or md5sum. You will get an ENCFF number for this file.
- Submit the bigWigs, refer to the BAM. You can refer to the already submitted file by ENCFF, alias, or md5sum. You will get an ENCFF number for each bigWig.