The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can develop sophisticated pipelines that answer complicated research questions by "streaming" several BEDTools together. The following are examples of common questions that one can address with BEDTools.
The fact that all of the BEDTools accept input from standard input (stdin) allows one to stream / pipe several commands together to facilitate more complicated analyses. Also, the tools allow fine control over how output is reported. Most recently, I have added support for sequence alignments in BAM (http://samtools.sourceforge.net/) format, as well as for features in VCF and GFF, as well as blocked BED format. The tools are quite fast and typically finish in a matter of a few seconds, even for large datasets.
As stated, much of the power in BEDTools comes from the ability to pipe multiple BEDTools together with UNIX commands. The following example will hopefully illustrate this strength.
Example: Imagine you have a BED file of SNP calls that were generated from some fancy new variant detection method. You are now doing an initial screen of the results. The SNP calls are genome-wide and of varied support and biological interest. The BED file of SNP calls might look like this, where the name field is the observed alleles and the score is the depth:
$ head snps.bed chr1 100 101 A/G 100 chr1 200 102 C/G 1000 ... chrX 300 301 C/T 500
Let's say you want to quickly find all transitions that are in exons. Using BEDTools and egrep, the command would be:
$ egrep "A/G|C/T" snps.bed | intersectBed -a stdin -b exons.bed > snpsInExons.bed
Great, but now you want to get to the interesting bits for your big paper, so you want to screen for novel variants by excluding SNP calls that are already in dbSnp. In this case, the "-v" option reports only those SNPs passed to intersectBed that are NOT in dbSnp.
$ egrep "A/G|C/T" snps.bed | intersectBed -a stdin -b exons.bed | intersectBed -v -a stdin -b dbSnp130.bed > novelSnpsInExons.bed
But now you subsequently detect an artifact where false positives are enriched in SNPs having coverage > 100. You refine my original query accordingly.
$ awk '$5 < 100' snps.bed | egrep "A/G|C/T" | intersectBed - a stdin -b exons.bed | intersectBed -v -a stdin -b dbSnp130.bed > bonafideNovelSnpsInExons.bed
(BAM) denotes tools that support BAM alignment files.
|intersectBed (BAM)||Returns overlaps between two BED/GFF/VCF files.|
|pairToBed (BAM)||Returns overlaps between a paired-end BED file and a regular BED/VCF/GFF file.|
|bamToBed (BAM)||Converts BAM alignments to BED6, BED12, or BEDPE format.|
|bedToBam (BAM)||Converts BED/GFF/VCF features to BAM format.|
|bed12ToBed6||Converts "blocked" BED12 features to discrete BED6 features.|
|bedToIgv||Creates IGV batch scripts for taking multiple snapshots from BED/GFF/VCF features.|
|coverageBed (BAM)||Summarizes the depth and breadth of coverage of features in one BED versus features (e.g, "windows", exons, etc.) defined in another BED/GFF/VCF file.|
|genomeCoverageBed (BAM)||Creates either a histogram, BEDGRAPH, or a "per base" report of genome coverage.|
|unionBedGraphs||Combines multiple BedGraph files into a single file, allowing coverage/other comparisons between them.|
|annotateBed||Annotates one BED/VCF/GFF file with overlaps from many others.|
|groupBy||Summarizes data in a file/stream based on common columns.|
|overlap||Returns the number of bases pairs of overlap b/w two features on the same line.|
|pairToPair||Returns overlaps between two paired-end BED files.|
|closestBed||Returns the closest feature to each entry in a BED/GFF/VCF file.|
|subtractBed||Removes the portion of an interval that is overlapped by another feature.|
|windowBed (BAM)||Returns overlaps between two BED/VCF/GFF files based on a user-defined window.|
|mergeBed||Merges overlapping features into a single feature.|
|complementBed||Returns all intervals not spanned by the features in a BED/GFF/VCF file.|
|fastaFromBed||Creates FASTA sequences based on intervals in a BED/GFF/VCF file.|
|maskFastaFromBed||Masks a FASTA file based on BED coordinates.|
|shuffleBed||Randomly permutes the locations of a BED file among a genome.|
|slopBed||Adjusts each BED entry by a requested number of base pairs.|
|sortBed||Sorts a BED file by chrom, then start position. Other ways as well.|
|linksBed||Creates an HTML file of links to the UCSC or a custom browser.|
$ cat reads.bed | intersectBed -a stdin -b genes.bed > readsToGenes.bed
Quinlan, AR and Hall, IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841842.
BEDTools was developed and is maintained by Aaron Quinlan, a postdoctoral fellow in Ira Hall's laboratory at The University of Virginia. Questions should be posted to the BEDTools discussion list. Alternatively, contact Aaron via email (firstlast at gmail.com).