REPEATMASKER DATABASES

These are the repeat libraries for the program RepeatMasker. To install, move or copy the files "RepeatMasker.lib" AND "version" into the "Libraries" subdirectory of the RepeatMasker directory. The RepeatMasker program, maintained by Arian Smit and Robert Hubley and copyrighted at the Institute for Systems Biology is distributed without its repeat consensus sequence file. These sequence data are copyrighted by the Genetic Information Research Institute and are available at the website http://www.girinst.org/server/RepBase/RepBase.rptmsk.tar.gz. Updates of the databases may not always coincide with software updates Check on this site or our server (www.repeatmasker.org) if a newer database version is available.

INCOMPATIBILITY WITH OLDER VERSIONS OF REPEATMASKER

Whereas previous versions used multiple files with repeat consensus sequences, RepeatMasker-open-3.1.2 of Oct 2005 and later versions use a single EMBL file combining all types of repeats found in all different species. The program creates a species specific set of repeat files upon the first use of a species. This new format will be maintained for the foreseeable future. This (necessary) change in formatting unfortunately creates a backward incompatibility. Older versions of the program do not work with the new library file and vice versa. To download the last version of RepeatMasker please visit http://www.repeatmasker.org

SPECIES COVERAGE

The RepeatMasker software package contains in the util directory the script queryRepeatDatabase.pl that will print or list all repeats included in an analysis for an indicated species. If your query species is not covered or if you have a larger set of repeats available, you can create your own libraries and use these with RepeatMasker using the -lib option.

RELATONSHIP WITH REPBASE UPDATE

We're maintaining these libraries as co-editor of Repbase Update, and are trying to keep them in synch with the RepBase Update libraries. However, at any one time there are differences. Entries can differ somewhat in sequence, generally not by more than a few percent. Reasons for this are that occasionally, independently derived consensus sequences thrive in either database or that updates to consensus sequences don't make it immediately to RepBase. The nomenclature is by and large identical, but we're aware of discrepancies and are attempting to eliminate these. One unavoidable origin of these differences is RepeatMasker's extensive post-alignment processing (=improvement) of the repeat annotation. To give one of many examples, internal sequences of LTR elements can be named after the flanking LTRs, even if there is no specific entry for that element in the databases. Quite a few entries in these libraries are not yet in the EMBL formatted RepBase Update (RU) because we have not yet submitted them formally. Others are missing from RU because it does not include all known subfamilies. On the other hand, a few RU entries may be missing from RepeatMasker libraries because our releases are lagging and longer in between the RU releases. Also, we do extra curation, and exclude entries in RU that give rise to false positives, would mask genes, or do not appear to be repetitive after all. Arian Smit PhD Institute for Systems Biology Seattle, WA asmit@systemsbiology.org RepeatMasker software and database development and maintenance are currently funded by an NIH/NHGRI R01 grant HG02939-01 to Arian Smit. RepBase Update development and maintenance are funded by NIH/NLM grant No.2P41LM006252-07A1 to Jerzy Jurka.