a generic tool for sequence alignment
An advanced guide to using exonerate
- This page gives some more advanced examples of using
to perform various types of pairwise comparison.
- It is assumed that you've already had a look at the examples
in the beginner's guide.
- Remember to use -h to get a short summary of available options,
or --help for a longer summary.
- Further information can be found on the
exonerate man page.
- If you have questions, or more examples to add to this tutorial,
please email me: firstname.lastname@example.org
- Sequence input options
- Using exonerate on a compute farm
- Picking output formats
- Applying score thresholds
- Increasing search sensitivity
- Increasing search speed
Sequence input options
- There are several ways to supply input to exonerate.
The simplest way is as shown below:
exonerate query.fasta target.fasta
- These inputs can contain single or multiple sequences.
- Exonerate will compare all queries with all targets.
- You can also specify using multiple queries or targets like this:
exonerate --query q1.fa q2.fa t2.fa --target t1.fa t2.fa t3.fa
- Furthermore, if the input is a directory, exonerate will recursively
read all fasta sequences contained within that directory.
- By default, all files with a .fa extension are assumed
to be fasta files, but you can change that behaviour using
--fastasuffix like this:
exonerate -q query_dir -t target_dir --fastasuffix .fasta
- By default, exonerate will try and guess which inputs
are DNA and which are peptide, by making the assumption
that if the first sequence contains >=85% ACGT, then
the database is nucleotide. This assumption can be wrong,
and the test may be slow if the first sequence
is an entire chromosome,
so the sequence type may be specified using the
--querytype and --targettype options,
such as in this example:
exonerate -q qy.fa -t tg.fa --querytype dna --targettype protein
Using exonerate on a compute farm
Picking output formats
- By default, exonerate will report alignments
in a human-readable format and also in 'vulgar' format.
you can turn these outputs off like this:
exonerate --showvulgar no --showalignment no ...
- You can also specify other output formats,
such as 'cigar' lines,
or GFF output on the query or target sequences:
exonerate --showcigar yes --showquerygff yes
- In addition to these output formats, it is possible generate
output in a format of your choosing by using the --ryo option.
- The --ryo (roll your own) option allows you
to print various fields using a printf-like syntax.
- For example if you wished to dump the portion of the
target sequence which appears in the alignment in FASTA format,
you could use an --ryo line like this:
exonerate --ryo ">%ti (%tab - %tae)\n%tas\n"
... to produce output such as this:
>hshcf2.embl (13651 - 13802)
- Using this allows you to generate easy-to-parse
output tailored to you needs.
A full list of the fields which can be used is in the
exonerate man page.
Applying score thresholds
- There are several ways to apply score threshold in exonerate
- applying a sensible score threshold not only reduced the
number of spurious alignments, but will also make the searches
run faster due to the way that BSDP works.
- --score : a simple score threshold
- --percent : report alignment over a percentage
of the maximum score attainable by each query
- --bestn : report the best N matches for each query.
exonerate --score 500
- more to follow ...
Increasing search sensitivity
Making quick and dirty alignments
exonerate -Q dna -T dna -m unagpped -w 14 --fsmmemory 512
--dnawordthreshold 0 --dnahspthreshold 140
Also, using the --bigseq option
... more to follow