Zhiliang's Workbench:
Information / progress track

Pig array re-annotations using new pig genome annotation data


Objectives: Use current NCBI RefSeq annotation information to enrich/update
            the annotations of the pig oligo sequences.

Works:

  1. Feb 12. 2014:  Pig RefSeq data sources

     Source   I: ftp://ftp.ncbi.nlm.nih.gov/genomes/Sus_scrofa/RNA/
                   - Contains 51,361 sequences.
     Source  II: http://www.ncbi.nlm.nih.gov/nuccore
                a. Limit to 'sus scrofa'[orgn] and filter by RefSeq[property]
                b. Manually filter out those that do not have annotation info:
        CLEAN |   - 4,562 "unplaced genomic scaffold" sequences
              |   - 5,652 "genomic scaffold" sequences
              |   - 6,195 "whole genome shotgun" sequences
                c. This end up with 51,488 annotated RefSeq sequences.
      ANALYZE |    Data from 'Source I' and 'Source II' differ by (51,488-51,209)
              |    => 279 sequences, and share 51,150 sequences (with additional
              |    397 sequences unique to one of them).
                d. Extract the sequence header and combine the 51,547 annotated
                   sequences as a pig refseq target data set
     Source III: ftp://ftp.ncbi.nlm.nih.gov/refseq/S_scrofa/ (official)
                   - Contains 51,362 sequences.

  2. Mar 14, 2014: BWA match against the pig refseq, of the consensus sequences from
                   the "2006 pig transcripts consortium"
                   (downloaded from http://www.pigoligoarray.org; now stored at the
                   AnimalGenome.ORG data repository -> "pig_Oligo_consensus_seq.fa.gz")

    o Briefly, of 18,224 "pig consortium (2006) oligo" sequences,
                  20,062 bwa matches to pig refseq were found on
                  16,784 uniq sequences (download):
    o Break down: 14,164 -> single match
                   2,108 -> 2 matches
                     404 -> 3 matches
                      79 -> 4 matches
                      24 -> 5 matches
                       3 -> 6 matches
                       1 -> 7 matches
                       1 -> 9 matches

  3. Mar 19, 2014: BWA match against the pig refseq, of the Swine Protein-Annotated
                   Oligonucleotide Microarray data (Illumina 70-mer Oligo synthesis; aka.
                   "GPL7435"; Out of the "2006 pig consensus transcripts consortium" work)

    o Briefly, of 20,400 "GPL7435" sequences
                  17,245 bwa matches to pig refseq were found on
                  17,207 uniq sequences (download):
    o Break down: 17,169 -> single match
                      38 -> 2 matches
       

















Zhiliang Hu