BLAST Frequently Asked Questions (FAQ)

What is BLAST?
What are Gapped and PSI-BLAST?
Which BLAST program should I use?
Why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
What is low-complexity sequence?
What is the Expect (E) value?
How do I perform a similarity search with a short peptide/nucleotide sequence?
How can I see low-similarity matches when there are many strong hits to my query sequence?

Q: What is BLAST?

BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990).

Q: What are Gapped BLAST and PSI-BLAST?

Gapped BLAST and PSI-BLAST are useful search tools provided by the BLAST server (version 2.0) (Altschul et al., 1997).

The Gapped BLAST algorithm allows gaps (deletions and insertions) to be introduced into the alignments that are returned. Allowing gaps means that similar regions are not broken into several segments. The scoring of these gapped alignments tends to reflect biological relationships more closely.

Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases.

Q: Which BLAST program should I use?

You have many choices to make between different BLAST programs and how to access them. Please see the Overview for more information on this topic. The easiest way to search is to use the BLAST Web pages. Simply paste your sequence into the box, choose a database, choose a program that matches your sequence to the database, and press Search. There are many additional parameters that can be controlled, but for a basic search, the default options work well.

Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?

You are seeing the result of automatic filtering of your query for low-complexity sequence that is performed to prevent artifactual hits. The filter substitutes any low-complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that reflect compositional bias rather than significant position-by-position alignment (Wootton & Federhen, 1996). Filter programs can eliminate these potentially confounding matches from the blast reports, leaving regions whose BLAST statistics reflect the specificity of their pairwise alignment. Queries searched with the blastn program are filtered with DUST. The other BLAST programs use SEG.

You can change the default and remove these filters if you like. On the Basic BLAST Web interface you will see a button to click that will remove the filter. On the Advanced Page you can set the filter to "none" in the menu. For email BLAST you can use the following command (filter NONE).

Q: What is low-complexity sequence?

Regions with low-complexity sequence have an unusual composition and this can create problems in sequence similarity searching (Wootton & Federhen, 1996). Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits (please see Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?

In BLAST searches performed without a filter, often certain hits will be reported with high scores only because of the presence of a low-complexity region. Most often, this type of match cannot be thought of as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.

Q: What is the Expect (E) value?

The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences.

The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.

Q: How do I perform a similarity search with a short peptide/nucleotide sequence?

First, you will probably need to increase the Expect (E) value in your search. A short query is more likely to occur by chance in the database. Therefore, even a perfect match can have low statistical significance and may not be reported. Increasing the E value allows you to look farther down in the hit list and see matches that would normally be discarded because of low statistical significance.

For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E value farther on the Advanced BLAST Web page by typing -e 10000, for example, in the Other Advanced Options Box.

If you still do not get results after increasing the E value, you may want to try decreasing the Word size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words" to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A good rule of thumb is that the query length must be at least twice the Word size. For example, if your query is a protein sequence of 4 residues, then the Word size should be reduced to 2. Please note that the smaller the Word size, the slower your search will be.

You can lower the default word size on the Advanced BLAST Web page. In the Other Advanced Options, type -W some_number (for example, -W 9).

Sometimes a short query sometimes does not produce results because it contains low-complexity sequence. Often this type of sequence can be recognized by the human eye because it looks very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of low-complexity sequence, then large portions of your query may be filtered out, essentially making your query shorter than you might have expected. Removing the filter will help in these cases.

Finally, you can change the matrix to optimize for searching with short protein sequences. For information on query length and the matrix see the document

Q: How can I see low-similarity matches when there are many strong hits to my query sequence?

Often, when the query is a member of a large sequence family, the summary hit list and the alignments returned only contain very high scoring hits. To look at low-similarity matches, you must increase the maximum number of results returned.
On the BLAST Web pages, often it is sufficient to increase the size of the summary hit list and the number of alignments shown using the menus on the Advanced pages. However, it is possible to increase the lists even further using the Other Advanced Options box on the Advanced BLAST pages.For BLAST 2.0, "-v 2000", for example, will increase the number of descriptions returned in the summary hit list to 2000. The option "-b 2000" will similarly increase the number of alignments returned.

[an error occurred while processing this directive]