|me-PCR - multithreaded electronic PCR: find STS's in FASTA sequence|
me-PCR - multithreaded electronic PCR: find STS's in FASTA sequence
me-PCR [options] sts_file fasta_file >output
OPTIONS: M=# Margin (default 50) N=# Number of mismatches allowed (default 0) W=# Word size (default 11) T=# Number of threads (default 1) X=# Number of 3'-ward bases in which to disallow mismatches (default 0) O=file Output file name (default stdout) Q=# Quiet flag 0 = verbose progress messages 1 = no progress messages (default) S=# Max. line length for the STS file (not counting line terminators) (default 1022) Z=# Default PCR size (default 240) I=# IUPAC flag 0 = don't honor IUPAC ambiguity symbols in STS's (default) 1 = honor IUPAC ambiguity symbols in STS's
me-PCR searches large sequences for Sequence-Tagged Site (STS) markers, or any sequence fragment that can be defined in terms of two short subsequences separated by an approximate distance. me-PCR was originally an enhanced version of Dr. Gregory Schuler's e-PCR program (see the HISTORY section at the bottom of this document). However, me-PCR has not been maintained for several years.
In general, you should e-PCR instead of me-PCR! e-PCR has improved greatly over the years.
me-PCR's performance and results are greatly affected by the M (margin), N (mismatches), and W (word size) parameters. See the following list for tips.
me-PCR can do ultra-fast matches of W adjacent bases (called the ``hash word'') in a primer. The price of this speed is that mismatches, if any are allowed by the N parameter, are not allowed to occur inside a hash word. If the left primer is found, a slower search determines if the rest of the left primer is present in the right spot. If the entire left primer is found, a much slower search determines if the right primer is in the range of locations determined by a) the stated PCR product length; b) the margin M; and c) for STS's with a size of 0, the default product length of 240. (The Z parameter can change this default). If the stated or default product length is denoted L, then the right primer is searched for in those positions such that the STS length would be greater than or equal to L - M and less than or equal to L + M. STS's are searched for in both orientations.
Figure 1. How me-PCR views a search once the left primer is found. PCR product size | M (margin) |---------------+---------------| | | |-------+-----| A B C D -//--[nnnn#####n]-----(<-------------+--[nnnnn]---]>)----//--
|--+-------| |-----| | | / |-+-| "Right" primer | \ | | "Left" Hash |-------------+-------------| primer word | (W) Possible location of right primer
A = sequence position of left primer B = leftmost possible position of left side of right primer C = expected position of right side of right primer D = rightmost possible position of right side of right primer
For each primer of an STS, a ``hash word'' is computed, which is essentially W adjacent bases somewhere inside the primer compressed into a 32-bit integer, where each of the four bases occupies two binary bits. Unlike original e-PCR, the hash word can occur anywhere, not just at the end. Both primers are used because me-PCR searches for the STS in both forward and reverse orientations. In Figure 1, the bases contributing to the hash word are denoted by pound signs. A primer can not be hashed if the W value is larger than the primer is long or if ambiguity symbols in the primer prevent the occurrence of W adjacent bases. If a primer can't be hashed, a warning is displayed, and the STS will not be searched for in the corresponding orientation.
Pointers to the STS's are placed in a hash table, which is simply an array of X pointers, where X is the number of possible hash words (2 raised to the 2*W power) (For large values of W, this table becomes very large; at W=11, the usage is 16 MB.) Note that it is perfectly possible for primers to share the same hash value (and hence hash table entry), in which case they are said to collide. Such colliding primers are dealt with by a linked list extending from the hash table entry.
When searching, me-PCR slides a W-sized window across the underlying sequence. For each W-sized window, me-PCR computes the corresponding hash value and determines which primers have a matching hash word. This determination is instantaneously provided by a simple array index operation. In the case that multiple primers match because they share the same hash word, me-PCR looks at each in turn by following a linked list from the main hash table entry. If the hash value of a primer matches, a direct string comparison is made against the entire primer. If this comparison succeeds, then direct string comparisons of the second primer against the underlying sequence are made. The second primer has an expected offset relative to the first primer (implied by a known PCR amplicon size), but the margin parameter M allows the position of the second primer to vary relative to the first, at the computational cost of 2*M additional string comparisons.
The sensitivity of a search is primarily dictated by two parameters: the number of potential mismatches N in the STS other than the hash word, and the margin M. As stringency is decreased (i.e. N or M is increased), the likelihood of a match increases. However, a trade-off exists between stringency and the probability of a false hit.
e-PCR is much improved and in general should be used instead of me-PCR.
0 = don't honor IUPAC ambiguity symbols in STS's (default) 1 = honor IUPAC ambiguity symbols in STS's
The new optional I (IUPAC) parameter allows me-PCR to correctly interpret ambiguity symbols in STS's. The IUPAC-based comparison is performed only for primers containing ambiguity symbols; a faster comparison is always performed for unambiguous primers.
The original me-PCR program does not honor ambiguous base symbols such as 'N','W', etc, during direct string comparisons. If the N parameter is 0 in original me-PCR, the effective result is that no STS containing an ambiguous base symbol will ever be localized. Likewise, if the N parameter is 0, no STS will ever be localized in a region of the underlying sequence which itself contains ambiguous base characters.
me-PCR does not honor IUPAC symbols in the underlying sequence, e.g. an 'A' in an STS will never match an 'N' in the underlying sequence. This is not an unreasonable limitation, because the vast majority of ambiguity symbols in chromosome assemblies are merely placeholders for gaps, taking the form of large blocks of 'N' symbols.
me-PCR allows an M value as low as 0, which, while not generally useful, is the most stringent search condition possible.
For ranged PCR sizes (product lengths), the margin value is applied to range, so that, if the range is 100-150 and M is 50, the effective range is 50-200.
0 = verbose progress messages 1 = no progress messages (default)
Note that the performance increase is far from linear; the rate of increase diminishes with each added thread. This effect is typical of SMP computers but varies somewhat depending on operating system and hardware configuration. Also note that performance increases slightly as the number of threads is increased above the number of CPU's. This effect can be used as a ``poor man's priority booster'', but it might be a better idea to adjust process priorities using the official mechanisms appropriate for the operating system.
If you are searching for STS's across multiple sequences, multithreading is not strictly necessary in order to take full advantage of multi-processor computers. As an example, one can simply start 8 simultaneous instances of a single-threaded program on an 8-processor computer, and the operating system will distribute the program processes across all processors. The obvious way to do this is to use a shell script to start 8 instances of a program in the background. If the number of jobs is larger than the number of processors, it is hard to control them properly with a simple shell script. The danger is that too many processes will run at the same time, causing CPU contention and quite likely disk thrashing, if each program uses a lot of memory (e.g. me-PCR operating on chromosome-length sequence). This problem can be solved in several ways: patiently running programs sequentially on one CPU (;-); writing a custom launcher script in Perl to run the jobs; or use GNU make. With a parallel make utility (such as GNU make with its 'j' option), many separate runs can be performed, all driven by a single makefile, and the make utility will ensure that exactly 8 processes are running at any given time. However, the multithreading feature of me-PCR is much easier to use, and just as fast!
me-PCR automatically uses just a single thread when processing sequences less than a certain amount, currently 100KB.
me-PCR is tuned for processing very large sequences; in fact, me-PCR is slower than the latest version of e-PCR (using W = 8) when operating on sequences less than ca. 3 MB, because of the overhead of creating an extra thread. This flaw will be fixed in the future.
The only disadvantages of a larger word size are 1) a larger memory requirement; 2) a possible, small increase in the number of STS's that won't be searched for; and 3) a possible, small increase in the number of STS's that can't be found when using N > 0 if one of the bases in the hash word is 'N' or wrong.
The second disadvantage affects a tiny percentage of STS's and applies in only two cases: 1) when the number of mismatches allowed (N) is nonzero; or 2) in the case of me-PCR with the IUPAC option enabled, even when N is 0. The issue is that me-PCR cannot build a W-length hash word for a primer unless there are W consecutive unambiguous bases in the primer. In the case of original me-PCR, the limitation is more severe in that the hash word can only occur at the end of the primer. For instance, at a word size of 7, original me-PCR is unable to search for 128 STS's out of the test UniSTS set, but me-PCR is unable to search for only 1, thanks to its variable offset hash feature. There is some biological justification for the limitation in NCBI e-PCR, but the justification becomes rapidly less relevant as the word size is increased.
If you want to use me-PCR to search for STS's containing ambiguous bases, there are two options. First, you can run me-PCR with the mismatch parameter N greater than 0. Doing so will reduce the stringency unnecessarily to treat a very special case. The preferable alternative is to run me-PCR with the mismatch parameter N set to 0 and the IUPAC parameter set to 1. This will retain stringency for all STS's except those with ambiguous bases, and will only allow IUPAC-determined mismatches in the latter.
A word size of around 11 seemed to produce optimal results. The optimal word size depends on platform and available memory, but word sizes of 10-12 are generally best. However, if too large a word size is specified, such that the memory demands of the hash table exceed physical memory, performance plummets due to page swapping.
The following table shows the relationship of word size W to hash table memory usage:
W Memory Time -- -------- ---- 1 0 MB NA 2 0 MB NA 3 0 MB NA 4 0 MB NA 5 0 MB 4238 6 0 MB 1143 7 .1 MB 315 8 .2 MB 96 9 1 MB 37 10 4 MB 23 11 16 MB 20 12 64 MB 21 13 256 MB 25 14 1,024 MB 2232 15 4,096 MB NA 16 16,384 MB NA
The time column shows the time to run ca. 80,000 STS's against chromosome 19 on an 867 MHz G4 OS X computer with 1.1 GB RAM. Cells marker 'NA' were not tested.
In one database of 70,498 human STS's, the average STS length is 165, the median is 155, and the percent of lengths between 190 and 290 (the range implied by the default margin (M) value of 50) is 24%. 8.1% of our test STS set has zero length. Given a similar dataset, researchers may wish to tune the default STS length using the Z option on the me-PCR command line. Use of a generous margin value M is also recommended. A Perl script for evaluating an STS set is provided in the me-PCR distribution. Alternatively, and possibly as a matter of course, users may wish to separate STS's with 0 length and run them with a large M parameter value.
0 if a search can be made; nonzero otherwise.
Use the V=1 switch to turn on verbose messages. These are written to STDERR; you may want to redirect to a file.
The STS input file should have the following format:
Field 1: Unique Id Field 2: Primer1 Field 3: Primer2 Field 4: PCR product size in bp Fields 5+: optional
This format is compatible with the UniSTS data file format (e.g. UniSTS_human.sts).
Lines are terminated by linefeeds (ASCII 10), the convention for UNIX text files.
The product size may be a range of numbers separated by a dash.
The unique ID is important for identifying output lines but is otherwise ignored.
Primers should not use notation such as '[A/T]'.
The FASTA input file should be a UNIX-style text file with the following format:
>label nnnnnnnnnn... ...
There can now be multiple sequences in this file. However, me-PCR does not yet assign individual sequences to separate threads. Therefore, for high-thoughput processing of many small sequences, it would currently be better to process them with separate invocations of me-PCR, controlled by a shell script or makefile.
me-PCR's output is tested against NCBI's e-PCR versions and double-checked with custom Perl scripts, such as the provided find_sts.pl program, to ensure no false positives. False negatives are eliminated using a custom test case generator that methodically generates thousands of combinations of STS and FASTA files. These test cases were designed to exercise boundary conditions within me-PCR and reduce the chance of bugs. Memory handling is checked using dmalloc. Multithreaded operation is tested and verified on Sun Solaris, IBM AIX, Apple OS X, SuSE Linux, and Microsoft XP platforms.
Original author: G. D. Schuler, NCBI
Current tweaker: Kevin Murphy, Children's Hospital of Philadelphia <email@example.com>
2008-02-18 1.0.6: Small sequences no longer cause aborts for T>1. 2004-04-13 1.0.5c: Multiple sequences in FASTA files accepted. 2004-04-05 1.0.5b: X parameter added. Tested on AIX.
me-PCR was developed by Kevin Murphy in Dr. Peter White's lab at Children's Hospital of Philadelphia to be an enhanced version of G. D. Schuler's e-PCR program, used and distributed by NCBI. Dr. Schuler published a paper about e-PCR: ``Sequence mapping by Electronic PCR'', Genome Research 7: 541-550, 1997. Thanks to Dr. Schuler for making his source code available.
|me-PCR - multithreaded electronic PCR: find STS's in FASTA sequence|