|
Blast Match of Query Sequences to Pig Genome Build 10.2 Genes by Their Map Locations
-- A Pilot Analysis of 3 Pig Array Platforms for Genome Enabled Annotations | |
|
Results:
|
README:
Blast Match of Query Sequences to Pig Genome Build 10.2 Genes by Their Map Locations
-- A Pilot Analysis of 3 Pig Array Platforms for Genome Enabled Annotations
PURPOSE:
Find out how the two Pig Affy array and Pig Oligo array elements are aligned
with annotated genes on the pig genome assembly version 10.2. This is to
enrich the gene annotation information on the Pig Affy array and Pig Oligo
array, and facilitate annotation comparisons between the platforms.
DATA:
- Affy 2005: Affymetrix "new" pig array designed in 2005. 23,935 consensus
sequences were used.
- Affy 2010: Affymetrix "new" pig array designed in 2010. 1,142,126 sequences
were combined from 3 data sets:
o SNOWBALL_array_seqs.fa
o SNOWBALL_consensus.fa -> unique_coding_seqs_for_array_v4.fa
o SNOWBALL_miRNA.fa -> miRNAs_array_seqs_v4.fa
(Data forwarded from Chris Tuggle, cktuggle@iastate.edu)
- Oligo 2006: The "70-mer" oligonucleotide array designed in 2006 by a
consortium group. Download: http://www.pigoligoarray.org/
In the 18,224 downloaded sequences, we do find many are longer
than 70-mer.
APPROACH:
The mapping of the Pig Affy array and Pig Oligo array elements were performed
by blast (NCBI blastall, v.2.2.22) again the SSC Build 10.2 for mapping. The
mapping results were subsequently taken to query the NCBI Gene DB for genes
that overlaps with the map coordinates. The blast criteria were set with these
empirical thresholds:
* Cut-off e-value: < 1e-3
* Identity > 80%
* Minimum alignment length > 30bp
Although the blastall options were set to take only the top hits, sub-optimal
hits were often found "leaked out". A perl/MySQL procedure were developed to
enforce these criteria.
To analyze the overlaps between the blast map coordinates and the genes on the
pig 10.2, a local database is set up with Ensembl pig gene mapping data. The
overlap analysis were performed by querying the local database for locations of
known Ensembl genes, compared against the blast match coordinates of the query
sequences for reliable overlaps.
Suppose a piece of genome sequence is represented with "--", on which a gene
region is bordered by "#" on each side, labeled as "i-->j"; a matched query
sequence is represented with "==" and labeled as "a-->b", various overlap
situations can be illustrated below to figure the matched strand information
(+/-, +/+, -/-):
i j
5' ---------#-------------#-------- 3'
(1) 5' -----a===#=====b-------#-------- 3' ..a..i..b..j..
(2) 5' ---------#----a========#==b----- 3' ..i..a..j..b..
(3) 5' ---------#--a======b---#-------- 3' ..i..a..b..j..
(4) 5' ------a==#=============#==b----- 3' ..a..i..j..b..
(5) 5' ------b==#=============#==a----- 3' ..b..i..j..a..
(6) 5' ------b==#========a----#-------- 3' ..b..i..a..j..
(7) 5' ---------#---b=========#==a----- 3' ..i..b..j..a..
(8) 5' ---------#---b=====a---#-------- 3' ..i..b..a..j..
i j
An arbitrary 50bp were also set as the minimum required overlaps to increase
the confidence of good overlap matches.
RESULTS:
The mapping data are made available on the NAGRP shared data repository:
http://www.animalgenome.org/repository/pig/Genome_build_10.2_mappings/
Two perl scripts were developed to (1) calculate and identify acceptable
overlaps, and (2) format data to facilitate further evaluations. The output
includes 2 files, each contains identical results but in different formats:
Note that some query sequences may have more than one gene matches. The query
sequence names are post-fixed with a serial number following a double colon (::).
For example, users can sort the list by query seq names to bring the same query
sequences together. A higher than '1' serial count indicates there are multiple
gene matches.
Output format 1, file name: "genes.match.byPlatform1.xlsx"
----------------------------
This format is useful to compare matched elements among the 3 platforms on the
same gene. The columns are:
- ENS_stable_ids: Ensembl stable ID
- External Names: "External" of Ensembl names, often HGNC symbols, although
it's not always the case.
- Genome_locations: Represented as "chromosome:start-end".
- Matched data sets: all or partial genome locations overlaps.
[Syntax: Seq_ID(overlap(strand):start-end(e-value))]
o Affy 2005 matches
o Affy 2010 matches
o pig_Oligo matches
Output format 2, file name: "genes.match.byPlatform2.xlsx"
----------------------------
This format is useful to allow user analysis by sorting the data in the way a
user wishes to. The columns are:
- ENS_stable_ids: Ensembl stable ID
- External Names: "External" of Ensembl names, often HGNC symbols, although
it's not always the case.
- Genome_locations: Represented as "chromosome:start-end".
- Gene length (bp):
- Data sets: Contains one of (1) Affy 2005 matches; (2) Affy 2010 matches;
(3) Pig_Oligo matches.
- Sequence names: The IDs of one of the array element sequences.
- Strand (quer/subj):
- Overlap length (bp):
- Overlap coordinates (bp):
- e-values: Blast e-values.
WORKING DIRECTORY:
~/projects/Tuggle_oligolocatn/
KNOWN BUGS: (none at this time)
FUTURE WORKS:
The methods and scripts used in this analysis may be further developed into a
publically available tool for users to upload custom sequence data sets, or known
coordinates, to return users with a list of matched genes + related information,
possibly linking to GBrowse for visualization.
Please send feedbacks and comments to cktuggle@iastate.edu or zhu@iastate.edu
--
Zhiliang Hu
May 29 09:42:18 CDT 2012
|
|
|
© 2003-2025:
USA · USDA · NRPSP8 · Program to Accelerate Animal Genomics Applications.
|
||