microRNA-seq Data Standards and Processing Pipeline

Assay Overview

 

MicroRNA-seq allows researchers to characterize and quantify the expression and prevalence of the small non-coding RNA moleccules known as microRNA. These molecules may play an important role in diseases, and significant effort is underway to understand their effects across a variety of tissue types and cells. For effective processing, the average insert size must be no more than 30 bases.

Updated May 2017

Pipeline Overview

The ENCODE miRNA-seq pipeline can be used for libraries generated from miRNAs, size-selected from total RNA to be 30 bp or smaller. . The microRNA-seq pipeline was developed by Ali Mortazavi's group at UC Irvine.

Pipeline Schematic

View the current instance of this pipeline

Inputs:

File format

Information contained in file

File description

Notes

fastq

reads

microRNA-seq reads from single-end libraries. Reads must meet the criteria outlined in the Uniform Processing Pipeline Restrictions
 gtf  genome annotation Default genome annotation file is the microRNA subset from GENCODE.  Human experiments use GENCODE V24, while mouse use GENCODE M4.

 

Outputs:

 

File format

Information contained in file

File description

Notes

bam

alignments

Produced by mapping reads to the genome. Reads are trimmed using a proprietary version of cutAdapt, linked below under References.
bigWig signal Normalized RNA-seq signal Signals are generated for both the plus and minus strands and for unique reads and unique+multimapping reads.
tsv gene (microRNA) quantifications Non-normalized counts.  
Quality control metrics are also generated, including gene quantification level and read depth.

 

The mapping of the reads is done using the STAR aligner. STAR is also used to obtain counts of miRNAs (number of reads mapped to each miRNA gene in the annotations file). The miRNA counts can be normalized, for example by library size, to obtain counts-per-million for downstream analysis.

References

Adapter Trimming

An important step before aligning miRNA-seq reads is the trimming of adapters. The raw sequencing reads for samples generated for this pipeline contain 3’ and 5’ adapters. The links below contain details of the pipeline including information about the adapter sequences used and an example script to trim these adapters:

  • View the adapter sequences and an example Cutadapt script (version 1.7.1 used to generate data by this pipeline) to trim the adapter sequences.

  • View the index generation step (using comprehensive GENCODE annotations) and the alignment step of trimmed reads Using STAR (using miRNA subset of GENCODE annotations).

Genomic References

View the mapping assembly and genome annotation reference files used in this pipeline

STAR Indices

This pipeline requires both assembly information for the species of interest and a gene reference. STAR creates an index for use in the mapping step. Please note that the comprehensive GENCODE references are used for the index generation step, while the miRNA subsets of the corresponding annotations are used at the alignment step to obtain miRNA counts by STAR.

The miRNA subset of GENCODE v24 for human (used at the mapping step by STAR)

The miRNA subset of GENCODE M4 for mouse (used at the mapping step by STAR)

Links and Publications

Find data generated by the pipeline here

Uniform Processing Pipeline Restrictions

  • The mapped read length should be a minimum of 16 base pairs. 
  • Sequencing should be single-ended. 
  • All Illumina platforms are supported for use in the uniform pipeline; colorspace (SOLiD) are not supported.
  • Library insert size range must be <30 and must be indicated in the metadata. 
  • Alignment files are mapped to either the GRCh38 or mm10 sequences.
  • Gene and transcript quantification files are annotated to either GENCODE V24 or M4.

Current Standards

  • Experiments should have two or more replicates. Assays performed using EN-TEx samples may be exempted due to limited availability of experimental material. 
  • Each replicate should have a minimum of 2 million aligned reads.
  • The gene level quantification should have a Pearson correlation of ≥0.85.
  • The experiment must pass routine metadata audits in order to be released.