README for REPCLASS
-------------------
**User configuration guide**
-------------------
(last update 06/09/09)

version 1.0.0

Feschotte Lab - University of Texas at Arlington

Contact information for FAQs and/bugs:
Cedric Feschotte
Email: cedric@uta.edu
Office: 817-272-2426
Lab: 817-272-5574
Fax: 817-272-2855

Thanks:
Umeshkumar Keswani
Nirmal Ranganathan
Marcel L. Guibotsy Mboulas
Assiatou Barrie
David Levine

-------------------

**User Configuration Guide**

This guide tells you where to access the different variables to be 
configured for each run, where to access and how read the outputs.

Please, read the entire 'User Configuration guide' before proceeding, and 
make sure you understand each step required for a successful configuration.

1. Copy the 'sample.conf' file from the 'REPCLASS/conf' folder to the 
   folder where the target genome is located and rename it appropraitely. 
   e.g. use the command:

   > cp /home/software/REPCLASS/conf/sample.conf  /home/your_name/genomes/repclass_configure.conf

2. Open this 'repclass_configure.conf' and make the following 
   modifications:

	2.1. Configure the '$JOB_NAME' entry to an appropriate job name 
             (e.g. name of the species and assembly, or name of organism):
		e.g. $JOB_NAME = "hg18";

	2.2. Configure the '$DATA' entry to the complete location of the 
             folder where you want the repclass output to be located.
		e.g. $DATA = "/home/your_name/REPCLASSoutputs/human_genome";

 	2.3. Configure the '$GENOME_LOCATION' entry to the complete 
             location of the target genome fasta file (generally where all 
             the genomes are located).
		e.g. $GENOME_LOCATION = "/home/your_name/genomes"

	2.4. '$GENOME_FILE' refers to the actuel name of the input target 
             genome fasta file.
		e.g. $GENOME_FILE = "hg18_all";

	2.5. '$TE_SEQUENCE' entry refers to the complete location of the 
             file which contains the consensus sequences that you want to 
             classify.
		e.g. $TE_SEQUENCE = "/home/your_name/ConsensusLibrary/human_repeats_cons_seq.txt";

3. Save the modifications and close the 'repclass_configure.conf' file.

4. Go to the 'REPCLASS/bin' folder and open the file 'runrc'. Replace 
   'ArgumentPath' by the complete location of the file 
   'repclass_configure.conf'. eg.:
   > perl /home/software/REPCLASS/bin/rc.pl ArgumentPath;
would become:
   > perl /home/software/REPCLASS/bin/rc.pl /home/your_name/genomes/repclass_configure.conf;

   The 'ArgumentPath' (the 'repclass_configure.conf' file) is passed into 
   the subsequent perl scripts to ensure the proper call of package 
   variables.

5. RUNNING REPCLASS

Finally, execute the 'runrc' script as it follows (in the same folder 
that you are in 4.):
	./runrc

Note:
	'runrc' will call another program: 'rc.pl'

	'rc.pl' will call 5 programs:

		4.1. xdformat, in the 'wublast' package, - to create a 
blastable database from the genome file, so that we can blast the 
consensus sequence library against this database.

		4.2. 'fasta_index.pl' - generates an index fasta file of 
the target genome file for faster execution.

		4.3. 'repclass.pl' - repclass perl script which opens 
directories for each output, separates each individual consensus sequence 
and passes them on to 'repmain.pl' as inputs through subsequent calls to 
this script. 

		4.4. 'repmain.pl' - classifies each sequence using the 3 
different classification methods namely Homology (HOM), Structural (STR) 
and TSD. Out of these methods Homology is the most reliable, followed by 
Structural.

		4.5. and 'result.pl' - collates the results obtained for 
every sequence using the above mentionned three classification methods. 
For each sequence in the ouput there are three classifications for the 
user to see. In case there are conflicting classifications, we recommend 
usign above mentionNed reliability ranking (Homology > Structural > TSD) 
to resolve it.

	Because the three REPCLASS' modules can be ran independently, be 
aware that when you start running the program, you will be prompted to 
chose among seven (07) options:
		
		1 Homology (HOM)

		2 Target Site Duplication (TSD)

		3 Structural (STR)

		4 HOM and TSD

		5 HOM and STR

		6 TSD and STR

		7 HOM, TSD and STR

	When prompted, type in the number which corresponds to your need, 
and then validate your answer (hit 'Enter' on the keyboard).

'runrc', 'rc.pl', 'fasta_index.pl', 'repclass.pl', 'repmain.pl', and 
'result.pl' are all located in the 'REPCLASS/bin' folder of the REPCLASS 
package.

6. HOWTO ACCESS AND READ THE OUTPUTS

	6.1. ACCESSING OUTPOUTS

REPCLASS delivers the results of the analyses in the '$DATA' folder (e.g. 
see 2.2. above). In '$DATA', you will find a folder named after the 
target genome file which in turn contains a file ("final.out.txt"), and a 
folder named after the "$JOB_NAME" entry (see e.g. in 2.1.). For a 
summary of the final results, you may want to open the "final.out.txt" 
file. The folder named after the "$JOB_NAME" entry contains 10 folders 
(which titles speak by themselves), and the "repclass.log file"; the 
"repclass.log file" records information about events such as failure(s) 
during the process.

	6.2. READING OUTPOUTS

The following would help comprehend the output in "final.out.txt" file:

 C()  means this is the classification under CLASS (i for 
      Retrotransposons and ii for Transposons).
 SC() means this is the classification under SUB-CLASS (DNA 
      under Transposon; LTR & Non-LTR under Retrotransposon)
 SF() means this is the classification under SUPER-FAMILY. Refer to 
      manuscript for details.
 H->  means this is the classification obtained using the Homology method.
 S->  means this is the classification obtained using the Structural 
      method. 
 S_LTR -> using LTR script, S_SSR using Short Sequence Repeat script 
     (refer manuscript for details).
 T->  means this is the classification obtained using Target Side 
      Duplication (TSD) method.

All the three methods (H,S, & T) can give all the three levels of 
classifications (C, SC, & SF).