README for REPCLASS ------------------- **User configuration guide** ------------------- (last update 06/09/09) version 1.0.0 Feschotte Lab - University of Texas at Arlington Contact information for FAQs and/bugs: Cedric Feschotte Email: cedric@uta.edu Office: 817-272-2426 Lab: 817-272-5574 Fax: 817-272-2855 Thanks: Umeshkumar Keswani Nirmal Ranganathan Marcel L. Guibotsy Mboulas Assiatou Barrie David Levine ------------------- **User Configuration Guide** This guide tells you where to access the different variables to be configured for each run, where to access and how read the outputs. Please, read the entire 'User Configuration guide' before proceeding, and make sure you understand each step required for a successful configuration. 1. Copy the 'sample.conf' file from the 'REPCLASS/conf' folder to the folder where the target genome is located and rename it appropraitely. e.g. use the command: > cp /home/software/REPCLASS/conf/sample.conf /home/your_name/genomes/repclass_configure.conf 2. Open this 'repclass_configure.conf' and make the following modifications: 2.1. Configure the '$JOB_NAME' entry to an appropriate job name (e.g. name of the species and assembly, or name of organism): e.g. $JOB_NAME = "hg18"; 2.2. Configure the '$DATA' entry to the complete location of the folder where you want the repclass output to be located. e.g. $DATA = "/home/your_name/REPCLASSoutputs/human_genome"; 2.3. Configure the '$GENOME_LOCATION' entry to the complete location of the target genome fasta file (generally where all the genomes are located). e.g. $GENOME_LOCATION = "/home/your_name/genomes" 2.4. '$GENOME_FILE' refers to the actuel name of the input target genome fasta file. e.g. $GENOME_FILE = "hg18_all"; 2.5. '$TE_SEQUENCE' entry refers to the complete location of the file which contains the consensus sequences that you want to classify. e.g. $TE_SEQUENCE = "/home/your_name/ConsensusLibrary/human_repeats_cons_seq.txt"; 3. Save the modifications and close the 'repclass_configure.conf' file. 4. Go to the 'REPCLASS/bin' folder and open the file 'runrc'. Replace 'ArgumentPath' by the complete location of the file 'repclass_configure.conf'. eg.: > perl /home/software/REPCLASS/bin/rc.pl ArgumentPath; would become: > perl /home/software/REPCLASS/bin/rc.pl /home/your_name/genomes/repclass_configure.conf; The 'ArgumentPath' (the 'repclass_configure.conf' file) is passed into the subsequent perl scripts to ensure the proper call of package variables. 5. RUNNING REPCLASS Finally, execute the 'runrc' script as it follows (in the same folder that you are in 4.): ./runrc Note: 'runrc' will call another program: 'rc.pl' 'rc.pl' will call 5 programs: 4.1. xdformat, in the 'wublast' package, - to create a blastable database from the genome file, so that we can blast the consensus sequence library against this database. 4.2. 'fasta_index.pl' - generates an index fasta file of the target genome file for faster execution. 4.3. 'repclass.pl' - repclass perl script which opens directories for each output, separates each individual consensus sequence and passes them on to 'repmain.pl' as inputs through subsequent calls to this script. 4.4. 'repmain.pl' - classifies each sequence using the 3 different classification methods namely Homology (HOM), Structural (STR) and TSD. Out of these methods Homology is the most reliable, followed by Structural. 4.5. and 'result.pl' - collates the results obtained for every sequence using the above mentionned three classification methods. For each sequence in the ouput there are three classifications for the user to see. In case there are conflicting classifications, we recommend usign above mentionNed reliability ranking (Homology > Structural > TSD) to resolve it. Because the three REPCLASS' modules can be ran independently, be aware that when you start running the program, you will be prompted to chose among seven (07) options: 1 Homology (HOM) 2 Target Site Duplication (TSD) 3 Structural (STR) 4 HOM and TSD 5 HOM and STR 6 TSD and STR 7 HOM, TSD and STR When prompted, type in the number which corresponds to your need, and then validate your answer (hit 'Enter' on the keyboard). 'runrc', 'rc.pl', 'fasta_index.pl', 'repclass.pl', 'repmain.pl', and 'result.pl' are all located in the 'REPCLASS/bin' folder of the REPCLASS package. 6. HOWTO ACCESS AND READ THE OUTPUTS 6.1. ACCESSING OUTPOUTS REPCLASS delivers the results of the analyses in the '$DATA' folder (e.g. see 2.2. above). In '$DATA', you will find a folder named after the target genome file which in turn contains a file ("final.out.txt"), and a folder named after the "$JOB_NAME" entry (see e.g. in 2.1.). For a summary of the final results, you may want to open the "final.out.txt" file. The folder named after the "$JOB_NAME" entry contains 10 folders (which titles speak by themselves), and the "repclass.log file"; the "repclass.log file" records information about events such as failure(s) during the process. 6.2. READING OUTPOUTS The following would help comprehend the output in "final.out.txt" file: C() means this is the classification under CLASS (i for Retrotransposons and ii for Transposons). SC() means this is the classification under SUB-CLASS (DNA under Transposon; LTR & Non-LTR under Retrotransposon) SF() means this is the classification under SUPER-FAMILY. Refer to manuscript for details. H-> means this is the classification obtained using the Homology method. S-> means this is the classification obtained using the Structural method. S_LTR -> using LTR script, S_SSR using Short Sequence Repeat script (refer manuscript for details). T-> means this is the classification obtained using Target Side Duplication (TSD) method. All the three methods (H,S, & T) can give all the three levels of classifications (C, SC, & SF).