README for REPCLASS
-------------------
**User configuration guide**
-------------------
(last update 06/09/09)

version 1.0.0

Feschotte Lab - University of Texas at Arlington

Contact information for FAQs and/bugs:
Cedric Feschotte
Email: cedric@uta.edu
Office: 817-272-2426
Lab: 817-272-5574
Fax: 817-272-2855

Thanks:
Umeshkumar Keswani
Nirmal Ranganathan
Marcel L. Guibotsy Mboulas
Assiatou Barrie
David Levine

-------------------

**User Configuration Guide**

This guide tells you where to access the different variables to be
configured for each run, where to access and how read the outputs.

Please, read the entire 'User Configuration guide' before proceeding, and
make sure you understand each step required for a successful configuration.

1. Copy the 'sample.conf' file from the 'REPCLASS/conf' folder to the
folder where the target genome is located and rename it appropraitely.
e.g. use the command:

> cp /home/software/REPCLASS/conf/sample.conf /home/your_name/genomes/repclass_configure.conf

2. Open this 'repclass_configure.conf' and make the following
modifications:

2.1. Configure the '$JOB_NAME' entry to an appropriate job name
(e.g. name of the species and assembly, or name of organism):
e.g. $JOB_NAME = "hg18";

2.2. Configure the '$DATA' entry to the complete location of the
folder where you want the repclass output to be located.
e.g. $DATA = "/home/your_name/REPCLASSoutputs/human_genome";

2.3. Configure the '$GENOME_LOCATION' entry to the complete
location of the target genome fasta file (generally where all
the genomes are located).
e.g. $GENOME_LOCATION = "/home/your_name/genomes"

2.4. '$GENOME_FILE' refers to the actuel name of the input target
genome fasta file.
e.g. $GENOME_FILE = "hg18_all";

2.5. '$TE_SEQUENCE' entry refers to the complete location of the
file which contains the consensus sequences that you want to
classify.
e.g. $TE_SEQUENCE = "/home/your_name/ConsensusLibrary/human_repeats_cons_seq.txt";

3. Save the modifications and close the 'repclass_configure.conf' file.

4. Go to the 'REPCLASS/bin' folder and open the file 'runrc'. Replace
'ArgumentPath' by the complete location of the file
'repclass_configure.conf'. eg.:
> perl /home/software/REPCLASS/bin/rc.pl ArgumentPath;
would become:
> perl /home/software/REPCLASS/bin/rc.pl /home/your_name/genomes/repclass_configure.conf;

The 'ArgumentPath' (the 'repclass_configure.conf' file) is passed into
the subsequent perl scripts to ensure the proper call of package
variables.

5. RUNNING REPCLASS

Finally, execute the 'runrc' script as it follows (in the same folder
that you are in 4.):
./runrc

Note:
'runrc' will call another program: 'rc.pl'

'rc.pl' will call 5 programs:

4.1. xdformat, in the 'wublast' package, - to create a
blastable database from the genome file, so that we can blast the
consensus sequence library against this database.

4.2. 'fasta_index.pl' - generates an index fasta file of
the target genome file for faster execution.

4.3. 'repclass.pl' - repclass perl script which opens
directories for each output, separates each individual consensus sequence
and passes them on to 'repmain.pl' as inputs through subsequent calls to
this script.

4.4. 'repmain.pl' - classifies each sequence using the 3
different classification methods namely Homology (HOM), Structural (STR)
and TSD. Out of these methods Homology is the most reliable, followed by
Structural.

4.5. and 'result.pl' - collates the results obtained for
every sequence using the above mentionned three classification methods.
For each sequence in the ouput there are three classifications for the
user to see. In case there are conflicting classifications, we recommend
usign above mentionNed reliability ranking (Homology > Structural > TSD)
to resolve it.

Because the three REPCLASS' modules can be ran independently, be
aware that when you start running the program, you will be prompted to
chose among seven (07) options:

1 Homology (HOM)

2 Target Site Duplication (TSD)

3 Structural (STR)

4 HOM and TSD

5 HOM and STR

6 TSD and STR

7 HOM, TSD and STR

When prompted, type in the number which corresponds to your need,
and then validate your answer (hit 'Enter' on the keyboard).

'runrc', 'rc.pl', 'fasta_index.pl', 'repclass.pl', 'repmain.pl', and
'result.pl' are all located in the 'REPCLASS/bin' folder of the REPCLASS
package.

6. HOWTO ACCESS AND READ THE OUTPUTS

6.1. ACCESSING OUTPOUTS

REPCLASS delivers the results of the analyses in the '$DATA' folder (e.g.
see 2.2. above). In '$DATA', you will find a folder named after the
target genome file which in turn contains a file ("final.out.txt"), and a
folder named after the "$JOB_NAME" entry (see e.g. in 2.1.). For a
summary of the final results, you may want to open the "final.out.txt"
file. The folder named after the "$JOB_NAME" entry contains 10 folders
(which titles speak by themselves), and the "repclass.log file"; the
"repclass.log file" records information about events such as failure(s)
during the process.

6.2. READING OUTPOUTS

The following would help comprehend the output in "final.out.txt" file:

C() means this is the classification under CLASS (i for
Retrotransposons and ii for Transposons).
SC() means this is the classification under SUB-CLASS (DNA
under Transposon; LTR & Non-LTR under Retrotransposon)
SF() means this is the classification under SUPER-FAMILY. Refer to
manuscript for details.
H-> means this is the classification obtained using the Homology method.
S-> means this is the classification obtained using the Structural
method.
S_LTR -> using LTR script, S_SSR using Short Sequence Repeat script
(refer manuscript for details).
T-> means this is the classification obtained using Target Side
Duplication (TSD) method.

All the three methods (H,S, & T) can give all the three levels of
classifications (C, SC, & SF).