Data Use, Software, and Analysis Release Policies

The goal of the Encyclopedia of DNA Elements (ENCODE) Project is to build a comprehensive catalog of candidate functional elements in the genome. The catalog includes genes (protein-coding and non-protein coding), transcribed regions, and regulatory elements, as well as information about the tissues, cell types and conditions where they are found to be active. The current phase of ENCODE (2016-2019) greatly expands the number of cell types, data types and assays and includes the study of both the human and mouse genomes.

Like the Human Genome Project, the ENCODE Project seeks rapid data dissemination and use by the entire scientific community. Accordingly, to encourage the widest possible use of the datasets, all data produced will be available for unrestricted use immediately upon release to public databases, eliminating the nine-month moratorium used previously in earlier phases of ENCODE. For guidelines on how to cite ENCODE data in your publications, please visit the Citing ENCODE page.

 

Software and Analysis Release Policy for ENCODE Consortium Members

Overview

The ENCODE Project aims to enhance biomedical research by generating community resources of genomics data, software, tools and methods for genomics data analysis, and products resulting from data analyses and interpretations.

As outlined in the ENCODE FOAs, NHGRI expects that the major resources resulting from this project, including data, software, and analyses, generated by the ENCODE awards will be made freely available to the research community. Timely release of software and analyses is a central part of ENCODE’s mission, to help to maximize the value of our work in the broader community, and enhance the reproducibility of our research.

We hope that the tools, methods, and standards developed by ENCODE will continue to have a lasting impact on the field. In particular, the software produced by ENCODE is likely to be an important legacy of the project. Software packages are often important community contributions, and may be extremely widely used and highly cited. Thus we strongly encourage ENCODE-funded labs to provide useable, documented software and associated publications, for the mutual benefit of the publishing labs and the consortium.

This document outlines policy for software and analysis release for ENCODE-funded research by all members of the ENCODE Consortium.

Software

Developers of significant new ENCODE-related software will make their programs, including source code, freely available. Examples include data processing pipelines and implementations of statistical, visualization, and modeling tools developed to process or analyze ENCODE data.

  • What to release: ENCODE requires the release of analysis pipelines used for major ENCODE products such as the ENCODE Encyclopedia. ENCODE strongly encourages release of software tools and pipelines used for major analyses in planned ENCODE publications, and other software likely to be useful to multiple groups either within ENCODE or in the broader community.
  • When to release software: The decision of when software should be released should balance the benefit to the community against the labor involved in software release and maintenance. Major software tools should be released as soon as they are sufficiently stable, and no later than the time they are first used in publications.
  • How to release software: Software should be released by version controlled public repositories (e.g Github). These repositories should be linked via the ENCODE DCC. Software should be well-documented and there should be a contact person for questions. Software development should continue through version-controlled deposition, and the software version used to generate each dataset should be documented.
  • Accompanying publications: In addition to the release of well-documented code, we strongly encourage developers to publish citable descriptions of their software. We recommend that authors describe their software in methodological papers so that they can receive credit for their work. These can be published in conventional journals, and/or disseminated pre-publication through pre-print servers (e.g. bioRxiv). 
  • Dissemination of more complex pipelines. For most complex analyses, multiple software components are routinely combined to generate intermediate datasets. For reproducibility of these results, analysts should document all software components used, and the specific software versions utilized. We encourage (1) documenting these components; (2) providing scripts that reproduce key figures in ENCODE publications; (3) establishing reusable, publicly accessible analysis pipelines (e.g. Galaxy, virtual machines, Docker, DNA Nexus sessions); and (4) linking these through the DCC website.
  • Current and Future Support for Released Software: Software that will be released for publication and to repositories (e.g., Github) should state the types and degree of support users can expect for them to download, run and troubleshoot the available software as well as whether or not updates and “fixes” should be expected.

 

Dissemination of Intermediate Data Analysis Results

All analysis results and data analysis products generated by the ENCODE consortium that will be of broad use to the community must be registered at the DCC under unique accession numbers as soon as they are stable, and certainly no later than the time of manuscript acceptance.

What to release. Examples of types of analysis results to be released include:

     1. Elements in the ENCODE Encyclopedia

     2. Other major maps of gene regulatory elements, key genomic features or predictions, and gene regulatory network models.

     3. Other analyses that are significant elements of planned publications.

When possible, analysis results should be released using standard ENCODE file formats. Analyses comprising major ENCODE deliverables (e.g. the Encyclopedia) may be subject to pre-release vetting by consortium members.

Analyses should be released in an unrestricted manner via the DCC when they are free of personally identifiable information (to current standards). When IRB rules apply, analysis products should be release via controlled access (such as dbGaP) using the appropriate sharing mechanism.

Analyses should be accompanied by written documentation, ideally in the form of a publication or a pre-print (e.g. bioRxiv), with clearly specified contact author. The release should specify and provide links to: (1) the datasets used, (2) the software used and the specific version, and (3) the specific pipelines used and their specific versions.

Good Software Design Practices: We recommend (1) using strict versioning and version control; (2) providing easy installation and compilation steps; (3) including simple test input datasets with matched output, and small unit tests for software with multiple components; (4) including realistic “production” tests, possibly matching publication figures; (5) specifying any unusual hardware requirements (e.g. CPU, RAM, disk).

 

Updated as of January 9, 2019