ENCODE Software
All software used or developed by the ENCODE Consortium
Showing 146 of 146 results
Number of displayed results:
- pyrangesGenomicRanges for Python.
- pandasPandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
- ZeroneZerone discretizes several ChIP-seq replicates simultaneously and resolves conflicts between them. Publication available at: doi: 10.1093/bioinformatics/btw336
- GEM-ToolsGEM-Tools is a C API and a Python module to support and simplify usage of the GEM Mapper.
- Fastx Toolkit — sourceThe FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
- bioraddbg ATAC-seq MACS2 — sourceThis Docker container provides an easy to use Docker interface to MACS2 for peak calling with settings tailored for Bio-Rad Single Cell ATAC-seq chemistry.
- bioraddbg ATAC-seq filter beads — sourceThis Docker container provides an easy to use Docker interface to a bead filtration tool with settings tailored for Bio-Rad Single Cell ATAC-seq chemistry. This container takes in .BAM files and performs "knee calling" to compute a bead barcode whitelist and jaccard index threshold for bead-to-droplet merging.
- bioraddbg ATAC-seq BWA — sourceThis Docker container provides an easy to use Docker interface to the BWA alignment tool with settings tailored for Bio-Rad ATAC-Seq chemistry.
- bioraddbg ATAC-seq deconvolute — sourceThis Docker container provides an easy to use Docker interface to BAP tool with settings tailored for Bio-Rad ATAC-seq chemistry.
- guppy_basecaller — sourceOnt-Guppy is a basecalling software available to Oxford Nanopore customers. For more information, please see https://nanoporetech.com/
- polyAsite_workflow — sourcePipeline to infer poly(A) site clusters through processing of 3' end sequencing libraries prepared according to various protocols.
- gencode_utr_fix — sourceThis package fixes UTR features in the third columns of Gencode GTF by converting UTR annotation into five_prime_utr and three_prime_utr similar to Ensembl.
- interpretation_samples — sourceInterpretation code for Segway samples that produces classifier output and diagnostic plots from the apply_samples.py, for test samples.Software type: genome segmentation
- split-pipe — sourceThe Parse Biosciences computational pipeline is an out-of-the-box software tool that you can run locally to convert fastq files straight to processed data (including gene-cell count matrices). Customers purchasing the Whole Transcriptome Kit will receive access to the Parse computational pipeline.
- PRINSEQ Lite — sourcePRINSEQ will preprocess genomic or metagenomic sequence data in FASTA or FASTQ format
- liftOverThis UCSC tool converts genome coordinates and genome annotation files between assemblies.
- fastq-tools — sourceA collection of small and efficient programs for performing some common and uncommon tasks with FASTQ files.Software type: other
- Cell Ranger — sourceCell Ranger is a set of analysis pipelines that process Chromium single-cell RNA-seq output to align reads, generate feature-barcode matrices and perform clustering and gene expression analysis (mkfastq, count, aggr, and reanalyze).
- pbsv — sourcepbsv is a suite of tools to call and analyze structural variants in diploid genomes from PacBio single molecule real-time sequencing (SMRT) reads. The tools power the Structural Variant Calling analysis workflow in PacBio's SMRT Link GUI. pbsv calls insertions, deletions, inversions, duplications, and translocations. Both single-sample calling and joint (multi-sample) calling are provided.
- freebayes — sourcefreebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.
- PysamPython module warapping htslib C-API and samtools for accessing sam formatted alignment filesSoftware type: other
- MATS — sourceMATS is a computational tool to detect differential alternative splicing events from RNA-Seq data. The statistical model of MATS calculates the P-value and false discovery rate that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold. From the RNA-Seq data, MATS can automatically detect and analyze alternative splicing events corresponding to all major types of alternative splicing patterns. MATS handles replicate RNA-Seq data from both paired and unpaired study design.
- Bowtie 2Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
- bigWigToWig — sourceThe binary bigWig format can be converted to the text based wig or bedGraph formats using this utility.Software type: file format conversion
- PyLiftover — sourcePyLiftover is a library for quick and easy conversion of genomic (point) coordinates between different assemblies. It uses the same logic and coordinate conversion mappings as the UCSC liftOver tool.Software type: other
- trim-adapters-illumina — sourceThis program will trim adapters from pair-end sequencing tags produced using the Illumina(c) platform.Software type: filtering
- edwBamFilter — sourceRemove reads from a BAM file based on a number of criteriaSoftware type: filtering
- edwBamStats — sourceCollect some basic characterization statistics of a BAM file.Software type: quality metric
- GATK — sourceThe Genome Analysis Toolkit or GATK is a software package for analysis of high-throughput sequencing data, developed by the Data Science and Data Engineering group at the Broad Institute. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.Software type: variant annotation
- SeparateReadpairs — sourceSplits up the interleaved file into two valid paired fastqs.Software type: other
- NextGenMap — sourceNextGenMap is a flexible and fast read mapping program that is more than twice as fast as BWA while achieving a mapping sensitivity similar to Stampy.Software type: aligner
- gnuplot — sourceGnuplot is a portable command-line driven graphing utility for Linux, OS/2, MS Windows, OSX, VMS, and many other platforms. The source code is copyrighted but freely distributed (i.e., you don't have to pay for it). It was originally created to allow scientists and students to visualize mathematical functions and data interactively, but has grown to support many non-interactive uses such as web scripting. It is also used as a plotting engine by third-party applications like Octave. Gnuplot has been supported and under active development since 1986.Software type: other
- Preseq — sourceFrom the Smith lab: "The preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment."Software type: other
- CpG methylation correlation — sourceCalculates spearman correlation of 2 replicate bedmethyl files of CpG methylation.Software type: quality metric
- Trim Galore — sourceTrim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing).Software type: filtering
- permseq — sourceAn R package that performs multi-read mapping of ChIP-seq datasets. permseq works with bowtie and takes as input fastq files . It can work with just ChIP-seq data or, when other complementary data such as DNase-seq, histone ChIP-seq are available, it can utilize these data sources as prior information for multi-read mapping. The output from permseq is a text file of aligned reads available in bed, tagAlign, or bam formats.
- mosaics — sourceAn R package for TF and histone ChIP-seq analysis. mosaics takes as input the aligned files. It provides diagnostics plots for evaluating how well the mosaics model fits and allows FDR control. The mosaics-hmm module provides boundary adjusted broad peak calls. The output from mosaics is a set of peaks in a number of formats including bed. mosaics also generates intermediate data files/objects such as genome-wide read counts at the bin level for specified bin sizes, wig files for visualizing on the browser.
- atSNP — sourceAn R package for screening SNPs for their potential to enhance or disrupt transcription factor binding sites. atSNP accepts as input either SNP ids or the actual coordinates of the SNPs and the alternative alleles. It uses ENCODE motifs and JASPAR motifs to evaluate the regulatory potential of the SNPs; however, it also allows user specified set of transcription factor binding sites in the form of position specific matrices. It outputs for each SNP the significance of the match to each position specific matrix with both the reference and the alternative allele and also the significance of the change in these match scores. atSNP also provides easy visualization of the SNP impact on the binding site by composite logo plots.
- wigToBigWig — sourceThe bigWig format is for display of dense, continuous data that will be displayed in the Genome Browser as a graph. BigWig files are created initially from wiggle (wig) type files, using the program wigToBigWig. The resulting bigWig files are in an indexed binary format. The main advantage of the bigWig files is that only the portions of the files needed to display a particular region are transferred to UCSC, so for large data sets bigWig is considerably faster than regular wiggle files.Software type: file format conversion
- Median Absolute Deviation — sourceCalculates the Median Absolute Deviation (MAD) and correlation of two gene quantifications from replicate RNA-seq experiments. A measure of reproducibility, inversely correlated with data quality.Software type: quality metric
- Tophat BAM Repair — sourcetophat_bam_xsA_tag_fix.pl was written by x wei to allow the use of tophat 2.0.8 in the ENCODE pipelines. It reads a bam file generated from paired-ended fastqs by tophat 2.0.8 and corrects the XS:A:+ or XS:A:- tags showing read strandedness.Software type: other
- Concat-fastq — sourceConcat-fastqs is an applet available for DNA-nexus to concatenate a set of fastqs that should be merged for analysis.Software type: other
- bedToBigBed — sourcebedToBigBed takes a standard bed file or a non-standard bed file with associated .as file to create a compressed bigBed version. Description of Big Binary Indexed (BBI) files and visualization of next-generation sequencing experiment results explained by W.J. Kent, PMCID: PMC2922891.Software type: file format conversion
- MakepseudoreplicatesGenerate psuedoreplicates for self-consistency tests.Software type: other
- WGBS output processor — sourceConvert a Bismark CX_report file to bed-like filesSoftware type: other
- Samtools — sourceSamtools is a suite of programs for interacting with high-throughput sequencing data. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments (PMID:19505943).Software type: other
- Picard — sourceA set of tools (in Java) for working with next generation high-throughput sequencing (HTS) data in the BAM format. Picard is implemented using the HTSJDK Java library HTSJDK, supporting accessing of common file formats, such as SAM and VCF, used for high-throughput sequencing data. Currenty no published paper for Picard software.Software type: filtering, other
- Bismark — sourceA tool to map bisulfite converted sequence reads and determine cytosine methylation states. The output produced by Bismark discriminates between cytosines in CpG, CHG and CHH context and enables bench scientists to visualize and interpret their methylation data soon after the sequencing run is completed (PMID: 21493656).Software type: other
- Flux Capacitor — sourceThe exonic structure of two spliceforms. FluxCapacitor recontructs abundances of known transcript forms from RNAseq data (PMCID: PMC3836232).Software type: transcript identification
- BWA — sourceBWA is a software package for mapping low-divergent sequences based on a Burrows-Wheeler index against a large reference genome, such as the human genome. Publications for the short read alignment component is found at PMID: 19451168, while PMID: 20080505 outlines the algorithm to align sequences >200bp up to 1Mb.Software type: aligner
- npIDR — sourceNon-parametric Irreproducibe Detection Rate (npIDR) essentially takes a pooled sample of all replicas and computes (a) the frequency of seeing count=x; (b) the frequency of seeing count=x given that in *ALL* other replicas the count is equal to zero. Original Irreproducible detection rate statistical test published (DOI: 10.1214/11-AOAS466).Software type: quality metric
- FastQC — sourceFastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. Babraham Bioinformatics Web site, http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Software type: quality metric
- bedGraphToBigWig — sourceConvert bedGraph to bigWig file. Description of Big Binary Indexed (BBI) files and visualization of next-generation sequencing experiment results explained by W.J. Kent, PMCID: PMC2922891Software type: file format conversion
- WASP — sourceWASP is a software package for two related tasks: (1) correcting allelic bias in mapped sequencing reads and, (2) identifying molecular quantitative trait loci (QTLs) using next-generation sequencing data (e.g. gene expression QTLs or histone mark QTLs). The WASP mapper works with any read mapping pipeline that outputs BAM or SAM format. WASP identifies molecular QTLs using a statistical test that combines information about the total depth and allelic imbalance of mapped reads. WASP can call QTLs with very small sample sizes (as few as 10) compared to traditional QTL mapping approaches.Software type: aligner, variant annotation
- Webgestalt — sourceWebGestalt is a "WEB-based GEne SeT AnaLysis Toolkit". It is designed for functional genomic, proteomic and large-scale genetic studies from which a large number of gene lists (e.g., differentially expressed gene sets, co-expressed gene sets, etc.) are continuously generated. WebGestalt incorporates information from different public resources and provides an easy way for biologists to make sense out of gene lists.
- RuleFit3 — sourceRuleFit3 is a predictive learning method and interpretational tool. It is based on general regression and classification models, which are constructed as linear combinations of simple rules derived from the data. Each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables (doi:10.1214/07-AOAS148).
- Mfinder — sourcemfinder is a software tool for network motifs detection. Network motifs are defined as basic interaction patterns that recur throughout biological networks, much more often than in random networks. In order to detect network motifs mfinder implements two methods: a full enumeration of subgraphs and a sampling of subgraphs for estimation of subgraph concentrations. mfinder generates random networks based on the switching method, the stubs method and "Go with the winners" algorithm.
- lumi package — sourceThe lumi package in R provides an integrated solution for the Illumina microarray data analysis. It includes functions of Illumina BeadStudio (GenomeStudio) data input, quality control, BeadArray-specific variance stabilization, normalization and gene annotation at the probe level. It also includes the functions of processing Illumina methylation microarrays, especially Illumina Infinium methylation microarrays.
- King — sourceKING is a rapid algorithm for relationship inference using high-throughput genotype data typical of GWAS that allows the presence of an unknown population substructure. The relationship of any pair of individuals can be precisely inferred by robust estimation of their kinship coefficient, independent of sample composition or population structure (sample invariance). KING performs properly even under extreme population stratification, while algorithms assuming a homogeneous population give systematically biased results. KING performs relationship inference on millions of pairs of individuals in the order of minutes.
- Java Treeview — sourceJava Treeview is an open source, cross-platform gene expression visualization tool and an interactive display of clustered gene expression data, similar to Eisen's treeview. It is also an extensible starting point for other gene expression visualization tools.
- HiveR — sourceThe hive plot is a visualization method for drawing networks. Nodes are mapped to and positioned on radially distributed linear axes. Edges are drawn as curved links. Hive plots can give quantitatively understanding for important aspects of a network's structure. Hive plots can also manage the visual complexity arising from a large number of edges and expose both trends and outlier patterns in a network structure.
- GSC (Genome Structure Correction)Assessing the significance of observations within large scale genomic studies using random subsampled genomic region is a difficult problem because there often exists a complex dependency structure between observations. GSC is a data subsampling approach based on a block stationary model for genomic features to alleviate the hidden dependencies. This model is motivated by earlier studies of DNA sequences, which show that there are global shifts in base composition, but that certain sequence characteristics are locally unchanging.
- GREAT — sourceGREAT assigns biological meaning to a set of non-coding genomic regions by analyzing the annotations of the nearby genes. Thus, it is particularly useful in studying cis functions of sets of non-coding genomic regions. Cis-regulatory regions can be identified via both experimental methods (e.g., ChIP-seq) and by computational methods (e.g. comparative genomics).
- GOrilla — sourceGOrilla is a web-based application that identifies enriched GO terms in ranked lists of genes, without requiring the user to provide explicit target and background sets. It also employs a flexible threshold statistical approach to discover GO terms that are significantly enriched at the top of a ranked gene list. Building on a complete theoretical characterization of the underlying distribution, GOrilla computes an exact p-value for the observed enrichment, taking threshold multiple testing into account without the need for simulations. The output of the enrichment analysis is visualized as a hierarchical structure, providing a clear view of the relations between enriched GO terms.
- GFS — sourceGFS is a program that maps peptide mass fingerprint data directly to raw genomic sequence, enabling rapid low-cost identification of proteins in genomes for which annotation is lacking. An experimentally obtained peptide mass fingerprint is entered into the program, which then scans a genome sequence of interest and outputs the most likely regions of the genome from which the mass fingerprint is derived.
- GERP — sourceGERP identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. These deficits, or rejected substitutions, are a natural measure of constraint that reflects the strength of past purifying selection on the element. GERP estimates constraint for each alignment column; elements are identified as excess aggregations of constrained columns. A false-positive rate (which is user-settable) is calculated using 'shuffled' alignments in which the order of columns is randomized.
- F-seq — sourceF-seq is a software package that generates a continuous density estimation of sequence tags mapped to a reference genome, which can be displayed using the UCSC Genome Browser. The continuous density plots are more intuitive than discrete histogram-like plots used by some applications. Using kernel density estimation, F-seq can aid the identification of biologically meaningful sites.Software type: peak caller
- FANMOD — sourceFANMOD is a tool for fast network motif detection. It relies on recently developed algorithms to improve the efficiency of network motif detection by orders of magnitude. This facilitates the detection of larger motifs in bigger networks than previously possible. Additional benefits of FANMOD are the ability to analyze colored networks, a graphical user interface and the ability to export results to a variety of machine-readable and human-readable file formats, including comma-separated values and HTML.
- DAVID — sourceDAVID is able to extract biological features and meanings associated with large gene lists. DAVID is able to handle any type of gene list, no matter which genomic platform or software package generated them. DAVID systematically maps a large number of interesting genes in a list to the associated biological annotation (e.g., gene ontology terms), and then statistically highlights the most overrepresented (enriched) biological annotation out of thousands of linked terms and contents.
- Cluster 3.0 — sourceCluster 3.0 is an implementation of k-means clustering, hierarchical clustering and self-organizing maps in a single multi-purpose open-source library of C routines, callable from other C and C++ programs. This library is an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. Additionally a Python and a Perl interface to the C Clustering Library is implemented to combine the flexibility of a scripting language with the speed of C.
- Circos — sourceCircos is a software package for visualizing data and information. It visualizes data in a circular layout for exploring relationships between objects or positions. Circos creates publication-quality infographics and illustrations with a high data-to-ink ratio, layered data and symmetries.
- Bowtie — sourceBowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
- BEDTools — sourceCollectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetics: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, and VCF.Software type: file format conversion
- ANNOVAR — sourceANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, as well as mouse, worm, fly, yeast and many others). Given a list of variants with chromosome, start position, end position, reference nucleotide and observed nucleotides, ANNOVAR can perform: (i) Gene-based annotation: identify whether SNPs or CNVs cause protein coding changes and the amino acids that are affected. (ii) Region-based annotations: identify variants in specific genomic regions, for example, conserved regions among 44 species, predicted transcription factor binding sites, segmental duplication regions, GWAS hits, database of genomic variants, DNAse I hypersensitivity sites, ENCODE H3K4Me1/H3K4Me3/H3K27Ac/CTCF sites, ChIP-Seq peaks, RNA-Seq peaks, or many other annotations on genomic intervals. (iii) Filter-based annotation: identify variants that are reported in dbSNP, identify the subset of common SNPs (MAF>1%) in the 1000 Genome Project, identify subset of non-synonymous SNPs with SIFT score>0.05, find intergenic variants with GERP++ score>2, or many other annotations on specific mutations.
- Regulatory Elements Database — sourceUsing an intuitive interface, you can 1) identify DNaseI-hypersensitive sites (DHS) within a genomic region of interest, 2) predict the target gene for DHS of interest, 3) predict the DHS that regulate a gene of interest, 4) identify clusters of similarly regulated DHS, that may have related function, 5) identify enriched motifs for transcription factors that may bind in these similarly regulated DHS, and 6) identify DHS that contain a DNA sequence motif for a transcription factor of interest. The Regulatory Elements Database provides access to roughly 2.8 million DNaseI-hypersensitive sites and their signal in 112 human samples, as well as Affymetrix microarray expression data for the same cell-types.Software type: database
- ENCODE-motifs — sourceA database that uncovers the molecular basis of TF binding in the human genome based on regulatory motif analysis of all Transcription Factors (TFs) grouped by family. This allows browsing of all known motifs for each factor, curated from TRANSFAC, Jaspar, and Protein Binding Microarray (PBM) experiments, and their enrichment and instances within corresponding TF binding experiments. It also provides a list of novel regulatory motifs discovered by systematic application of several motif discovery tools (including MEME, MDscan, Weeder, AlignACE) and evaluated based on their enrichment relative to control motifs within TF-bound regions. ENCODE-motifs also provides a genome-wide map of regulatory motif instances in the human genome for both known and novel motifs.Software type: database
- Factorbook — sourceFactorbook is a transcription factor (TF)-centric web-based repository of integrative analysis associated with ENCODE ChIP-seq data. It includes de novo discovered motifs, chromatin features surrounding ChIP-seq peaks (histone modification patterns, DNase I cleavage footprints, and nucleosome positioning profiles), deep-learned models of sequence features driving TF binding, and integration with GWAS variants and the ENCODE Registry of candidate cis-regulatory elements.Software type: database
- PIQ: Protein Interaction Quantification — sourcePIQ is a computational method that models the magnitude and shape of genome-wide DNase profiles to facilitate the identification of transcription factor (TF) binding sites. The input of PIQ is one or more DNase-seq experiments, the genome sequence of the organism assayed and a list of motifs represented as position weight matrices (PWMs) that describe candidate TF binding sites. PIQ uses machine learning methods to normalize input DNase-seq data and then predicts TF binding by detecting both the shape and magnitude of DNase profiles specific to each TF. The output of PIQ is the probability of occupancy for each candidate binding site in the genome, along with aggregate TF-specific scores (e.g. metrics for TF-specific chromatin opening).Software type: database
- RegulomeDB — sourceIdentifies DNA features and regulatory elements in non-coding regions of the human genome. One can enter dbSNP IDs, BED files, VCF files, or GFF3 files. A score is returned assessing the evidence for regulatory potential. Clicking on the score reveals the data supporting the inference, by data type and cell type. One can also click on hyperlinks to see the SNP or the region in the UCSC browser, ENSEMBL browser, and dbSNP.Software type: database, variant annotation
- HaploReg — sourceExplores annotations of the noncoding genome at variants on haplotype blocks, such as candidate regulatory SNPs at disease-associated loci. Under Set Options tab, set Browse ENCODE button to "on" and select an LD threshold and reference population. Under Build Query Tab, enter a SNP (rsXXXXX), a set of SNPs, a genomic region, or select a GWAS from the drop down menu. HaploReg returns SNPs in LD with query SNPs, their frequency in 4 populations from 1000 Genomes Phase1, and also tells you what evidence ENCODE has found for regulatory protein binding (mouse over to see the protein names), chromatin structure (mouse over to see the cell types with DNase hypersensitivity), the chromatin state of the region (the chromatin state can predict an enhancer or promoter), and putative transcription factor binding motifs that are altered by the variant. Clicking on the SNP name hyperlink reveals further details, including cell type metadata and the mechanism of disruption/creation of TF binding regulatory motifs (showing the PWM matched and its alignment to the local sequence context). SNPs are also intersected with cross-species conserved elements, chromatin states from the Roadmap Epigenomics Consortium, and lead eQTLs from the GTEx Project browser.Software type: database, variant annotation
- Genomedata — sourceEfficiently stores multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. Utilities have also been developed to load data into this format. A reference implementation in Python and C components is available under the GNU General Public License.
- BEDOPS — sourcePerforms common genomic analysis tasks and offers improved flexibility, scalability and execution time characteristics over previously published packages. The suite includes a utility to compress large inputs into a lossless format that can provide greater space savings and faster data extractions than alternatives.Software type: file format conversion
- SPOT (Signal Portion of Tags) — sourceMeasures signal-to-noise in genome-wide epigenetic profiling assays by calculating the fraction of reads that fall in tag-enriched regions (see the Hotspot program) from a sample of 5 million reads. The SPOT methodology can be generalized to use any peak-finder. A publication of SPOT and a more complete description are in preparation. SPOT is simply the percentage of all tags that fall in hotspots, and the publication for the Hotspot quality metric is found at PMID: 21258342.Software type: quality metric
- Phantompeakqualtools — sourceUsed to generate these quality metrics: NSC and RSC. The NSC (Normalized strand cross-correlation) and RSC (relative strand cross-correlation) metrics use cross-correlation of stranded read density profiles to measure enrichment independently of peak calling.Software type: quality metric, filtering
- CAGT (Clustering AGgregation Tool) — sourceDeciphers the heterogeneity and diversity of profiles of functional signals (e.g., chromatin mark ChIP-seq signal) centered at a collection of sites (e.g., TSSs or TF binding sites) in a genome. Rather than averaging the profiles over all the anchor sites (traditional aggregation plots), CAGT accounts for the inherent heterogeneity in signal magnitude, shape and implicit strand orientation of chromatin marks. CAGT partitions the set of anchor sites into compact clusters such that each cluster represents anchor points that show similar patterns of the functional signal profiles with different clusters having distinct patterns. The different groups of patterns are often enriched for distinct biological functions (PMID: 22955985).Software type: genome segmentation
- Segway — sourceUses a machine learning method to analyze multiple tracks of functional genomics data, searching for recurring patterns. The software automatically partitions the genome into non-overlapping segments and assigns each segment a label. The resulting annotation provides a human-interpretable summary of the functional landscape of the genome, yielding hypotheses about novel instances or classes of functional elements.Software type: genome segmentation
- Segtools — sourceA Python package that analyzes genomic segmentations. The software efficiently calculates a variety of summary statistics and produces corresponding publication quality visualizations. The overall goal of Segtools is to provide a bird's-eye view of complex genomic data sets, allowing researchers to easily generate and confirm hypotheses.Software type: genome segmentation
- Wiggler — sourceProduces normalized genome-wide signal coverage tracks from raw read alignment files. Allows pooling of replicate datasets while allowing for replicate and data-type specific read shifting and smoothing parameters. It can be used to generate signal density maps for ChIP-seq, DNase-seq, FAIRE-seq and MNase-seq data. Wiggler also implicitly models variability in mappability to appropriately normalize signal density and distinguish missing data from true zero signal.
- Irreproducible Discovery Rate (IDR) — sourceMeasures consistency between replicates in high-throughput experiments. Also uses reproducibility in score rankings between peaks in each replicate to determine an optimal cutoff for significance. The core IDR R package can be downloaded from the IDR download page: http://cran.r-project.org/web/packages/idr/index.htmlSoftware type: quality metric, filtering
- Hotspot — sourceIdentifies regions of local enrichment, including peaks, in genomic short-read sequence data. Uses the binomial distribution with a local background model to automatically correct for broad-scale regional differences in tag levels. It is applicable to a wide variety of epigenetic profiling assays, including ChIP-seq and DNase-seq. Hotspot forms the basis of the SPOT data quality metric.Software type: peak caller
- Scripture — sourceReconstructs transcriptomes, relying solely on RNA-seq reads and an assembled genome to build a transcriptome ab initio. The statistical methods to estimate read coverage significance are also applicable to other sequencing data. Scripture also has modules for ChIP-seq peak calling.Software type: transcriptome assembly
- Flux Capacitor — sourceA program to estimate the frequencies of annotated transcripts (GTF format) from an RNA-Seq experiment, solving a linear program inferred from the observed read mappings (BAM format). There are options for single, stranded, and/or paired-end reads.
- MACS — sourceA widely-used, fast, robust ChIP-seq peak-finding algorithm that accounts for the offset in forward-strand and reverse-strand reads to improve resolution and uses a dynamic Poisson distribution to effectively capture local biases in the genome. MACS 1.4 was used in the ENCODE 2 uniform peak calling pipeline.Software type: peak caller
- GEM — sourceGEM is a Java software package for analyzing genome wide ChIP-seq/ChIP-exo data. GEM can decompose single observed peaks into multiple binding events, determine binding event location at high spatial resolution, and discover explanatory DNA sequence motifs with an integrated model of ChIP reads and proximal DNA sequences. GEM is able to process single-end or paired-end data and can be run in single-condition mode or multi-condition mode. GEM will be used in the ENCODE 3 uniform peak calling pipeline.Software type: peak caller
- SPP — sourceA ChIP-seq peak calling algorithm, implemented as an R package, that accounts for the offset in forward-strand and reverse-strand reads to improve resolution, compares enrichment in signal to background or control experiments, and can also estimate whether the available number of reads is sufficient to achieve saturation, meaning that additional reads would not allow identification of additional peaks. SPP will be used in the ENCODE 3 uniform peak calling pipeline.Software type: peak caller