Questions and Answers
Supplementary Material to our manuscript: "CateGOrizer: A Web-Based Program to Batch Analyze Gene Ontology Classification Categories."
-
What is GO?
-
What is GO Terms Classifications Counter?
-
What is DAG?
-
What this tool is designed for?
-
How does it work for me?
-
How are the results presented to me?
-
What's the file format for upload?
-
Can I use my own classification method?
-
How are ancestral classifications counted?
-
What is Transitive Closure?
-
What are the "single occurences" and "all occurences"?
-
Why my sum of the counts is more than the total GO terms in the input?
-
What are "counted terms" and "odd terms"?
-
How to properly interpret my results? Any hints to make better use of the tool?
-
Is there a way to improve my counting results?
-
How may I cite the use of this GO Classification Counter tool?
What is GO?
GO is short for Gene
Ontology which uses a controlled vocabulary to describe gene and gene
product attributes in any organism, and to describe the basic term categories
and relationships.
What is GO Terms Classifications Counter?
The GO Terms Classifications Counter is comprised of several perl CGI programs
coupled with a MySQL DBMS that stores the GO terms DAG data (pre-computed terms
associations).
What is DAG?
DAG is short for
Directed Acyclic Graph, a directed grap with no directed cycles, that is,
for any vertex v, there is no nonempty directed path that starts and ends on v.
DAGs can be considered to be a generalization of trees in which certain subtrees
can be shared by different parts of the tree. In a tree with many identical subtrees,
this can lead to a drastic decrease in space requirements to store the structure.
The Figure on the right shows a simple DAG structure (a) and how the the distances
between terms may be calculated (b).
What this tool is designed for?
Originally this tool was created to meet the needs in large scale EST analysis
where people wish to get a rough idea how their annotated ESTs are distributed
in different gene categories. This tool also finds its utility in understanding
microarray data where large number of up/down regulated genes may be quickly
categorized to show the trends of changes in terms of which gene classes are
changed most. This can be used in combination with, for example, pathway analysis.
How does it work for me?
The program takes input of GO terms (e.g. GO:0007067) in a
list format or unformatted plain text file, and allows user to choose one of
the available classification methods such as GO_slim,
GOA, EGAD, MGI_GO_slim, ... or GO-ROOT (You can upload your own "slim"
classification to use. However, if you have a new "slim" classification that
you don't want to upload every time, and wish to be made useful to others as
well, let us know and we will be happy to add it), performs the count, and
returns the results on the web (if it takes too long, it will email the user
with the link to results). The output includes the counts and percentages in
a sorted tabular form, plus a pie chart.
How are the results presented to me?
|
Shown on the right is a snap shot of the main part of a CateGOrizer results
page where the counted GO terms are shown in tabulated form along with a pie
chart. The raw GO IDs are also appended below as a reference, separated in
"counted" and "odd" groups. Each GO term and GO class in the graph is hyper-
linked to AmiGO browser at the GO web site for details
(Click on the graph to see a real example).
|
( Click on the graph for a larger picture )
|
What's the file format for upload?
The GO Term files for upload must be in plain text (ascii) format, prefereably
laid out as a list, one entry per line, as shown in the following example.
GO:0008629
GO:0051241
GO:0007157
GO:0016849
GO:0004383
... ...
However this format is not strictly enforced because the program is "smart enough"
to fish things out if the file is not formated as a single column list.
Can I use my own classification method?
Sure you can. Your classification file must be in a list, plain text (ascii)
format, one line per entry, led with GO_ID and separated from its definitions
with a tab or space, as in:
GO:0009058 biosynthesis
GO:0008152 metabolism
GO:0009056 catabolism
GO:0019725 cell homeostasis
GO:0005515 protein binding
GO:0005840 ribosome
... ... ... ...
How are ancestral classifications counted?
The transitive closure are built in
GO database
schema and the paths are pre-computed from every node to all of its
ancestors, which is equivalent to computing the path from every node to all of
its descendents (See reference "ii"). The count of a term into to a parental
term is simply the process to find if the path lead its way to the ancestor.
What is Transitive Closure?
The relationships between GO terms are not in simple tree structure (although
sometimes partial structure may appear to be like hierarchies), but are
organized in structures called directed acyclic graphs (DAGs). For example
a GO term may have more than one child, and/or have more than one parent (See
reference "i"). The transitive property of numbers states that if A = B and
B = C, then A = C. Derby applies this property to query predicates to add
additional predicates to the query in order to give the optimizer more information.
This process is called
transitive closure.
What are the "single occurences count" and "all occurences count"?
It is common that multiple paths exist between an ancestral/parental term and a
child term. Single occurences count is that we count only once when
multiple paths are found between an ancestral term and a child term.
The Single occurences count is often used to get an idea how the child terms are
"classified". It can often effectively avoid an inflated total counts. In contrast,
when each and all paths between an ancestral term and a child term are counted,
we call it all occurences count. This can be useful to get an idea how complex
the terms are related.
Why my sum of the counts is more than the total GO terms in the input?
One may ideally wish a Term belongs to, and is counted into, an ancestor term
only once. However in reality this is often not the case. As discussed above,
a GO term can trace back to more than one ancestor parents, and therefore the
trace of a child term to an ancestor term may form more than one independent
path, making a (child) term counted multiple times into an ancestor term.
Consider situation where multiple consecutive relationship exists along a path,
and such relationship "fork" might happen to any of the element terms, thus the
overall sum of counts may appear much "inflated". If such "inflation" is
significantly larger than acceptable, one may want to scrutinate the GO Terms
in the classification used, see if their representative coverage overlap, and
do some necessary manipulations of the counting results to improve the
represenation (see answers to the next question).
What are "counted terms" and "odd terms"?
Along with the counted results, the Counter lists the "Counted Terms" and "Odd
Terms". The "Counted Terms" are those that are found belonging to at least one
of the classes in a given classification method. The "Odd Terms" are not found
belonging to any classes in the given classification method. Both groups of terms
are listed in a sorted order, with the frequence of a term in the raw GO Term data
set placed in parenthesis postfixing each term. This information maybe helpful
for users to evaluate how the GO annotations are represented under the given
classification method. For example, if there are many "odd terms" to a
classification method, it may imply that the method may not be the best fit for
the raw data set. Users are encouraged to try out different classification methods
or modify the classification to improve the representation of the raw data set.
The classification that produce the least number of "odd terms" may be the best
for your data set.
How to properly interpret my results? Any hints to make better use of the tool?
As the count of GO terms into each ancestral term is independent of one another,
the counted results may be selectively used, and percentages recalculated as long
as the selected terms represent a well covered spectrum of your scope. This helps
to avoid redundancy, and may make best sense for a particular data set. For example,
in the "GO_Slim" classification, the three main GO classes, "molecular_function",
"cellular_component", and "biological_process", may themselves already constitute
a full coverage of the GO spectrum, therefore overlap with other (child) terms in
the "GO_Slim" in terms of representative coverage. Therefore the counts of these
three terms may be taken out, and the sum of remaining terms can still be taken as
"100%", and the percentages for each term recalculated.
Is there a way to improve my counting results?
Yes. As shown in the following graph for an example, Comparison of two analysis of
one raw GO id data set, showing the importance of properly selecting a classification
method. In (a), a "immune class" was chosen as the classification method; while in
(b) the "GO slim" was the classification method.
( Click on the graph for a larger picture )
A closer examination at a classification counting results on 3,216 GO terms, showing
the importance of carefully selecting a suitable "GO Slim" to use. In (a), three
root GO classes: "biological_process", "molecular_function", and "cellular_component"
were included in the classification method, which caused much inflation of the total
counts due to the overlaps between parental classification terms. In (b), when these
three classes were excluded from the classification method, the counts appear to be
"more normal".
| GO Class ID |
Definitions |
Counts |
Fractions |
|
GO Class ID |
Definitions |
Counts |
Fractions |
| GO:0008150 |
biological_process |
971 |
12.74% |
|
GO:0008150 |
biological_process |
971 |
39.17% |
| GO:0003674 |
molecular_function |
780 |
10.23% |
|
GO:0003674 |
molecular_function |
780 |
31.46% |
| GO:0005575 |
cellular_component |
728 |
9.55% |
|
GO:0005575 |
cellular_component |
728 |
29.37% |
| GO:0005623 |
cell |
589 |
7.73% |
|
Total |
|
2479 |
100.00% |
| GO:0005488 |
binding |
354 |
4.65% |
|
GO Class ID |
Definitions |
Counts |
Fractions |
| GO:0005622 |
intracellular |
290 |
3.81% |
|
GO:0005623 |
cell |
589 |
7.73% |
| GO:0008152 |
metabolism |
250 |
3.28% |
|
GO:0005488 |
binding |
354 |
4.65% |
| GO:0003824 |
catalytic activity |
207 |
2.72% |
|
GO:0005622 |
intracellular |
290 |
3.81% |
| GO:0005515 |
protein binding |
192 |
2.52% |
|
GO:0008152 |
metabolism |
250 |
3.28% |
| GO:0007154 |
cell communication |
179 |
2.35% |
|
GO:0003824 |
catalytic activity |
207 |
2.72% |
| GO:0005737 |
cytoplasm |
161 |
2.11% |
|
GO:0005515 |
protein binding |
192 |
2.52% |
| GO:0005886 |
plasma membrane |
143 |
1.88% |
|
GO:0007154 |
cell communication |
179 |
2.35% |
| GO:0005576 |
extracellular region |
137 |
1.80% |
|
GO:0005737 |
cytoplasm |
161 |
2.11% |
| GO:0007165 |
signal transduction |
135 |
1.77% |
|
GO:0005886 |
plasma membrane |
143 |
1.88% |
| GO:0004871 |
signal transducer activity |
110 |
1.44% |
|
GO:0005576 |
extracellular region |
137 |
1.80% |
| GO:0007275 |
development |
96 |
1.26% |
|
GO:0007165 |
signal transduction |
135 |
1.77% |
| GO:0016043 |
cell organization and biogenesis |
94 |
1.23% |
|
GO:0004871 |
signal transducer activity |
110 |
1.44% |
| GO:0016787 |
hydrolase activity |
84 |
1.10% |
|
GO:0007275 |
development |
96 |
1.26% |
| GO:0019538 |
protein metabolism |
79 |
1.04% |
|
GO:0016043 |
cell organization and biogenesis |
94 |
1.23%
|
| GO:0005615 |
extracellular space |
75 |
0.98% |
|
GO:0016787 |
hydrolase activity |
84 |
1.10% |
| ... |
... |
... |
... |
|
... |
... |
... |
... |
| Total |
|
6621 |
100.00% |
|
Total |
|
4142 |
100.00% |
How may I cite the use of this GO Classification Counter tool?
A menuscript is in preparation. In the meantime you are welcome to cite the use
of this tool as:
If you have any question that are not answered in this FAQ, please feel free to send
it to us. It may be a good question that we have addressed but missed. It might be
a useful addition for a new entry in this FAQ!
REFERENCES
- The Gene Ontology Consortium (2006). An Introduction to the Gene Ontology.
URL:
http://www.geneontology.org/GO.doc.shtml,
Last modified: February 17, 2006.
- Chris Mungall, (2004). GO Database Schema. URL:
http://www.godatabase.org /dev/sql/doc/godb-sql-doc.html,
Last modified: April 5, 2004.
- Hu, Zhi-Liang, Jie Bao and James M. Reecy (2007).
A Gene Ontology (GO) Terms Classifications Counter.
Plant & Animal Genome XV Conference, San Diego, CA,
January 13-17, 2007.
- Hu, Zhi-Liang, Jie Bao and
James M. Reecy (2008).
CateGOrizer: A Web-Based Program to Batch Analyze Gene Ontology Classification
Categories. Online Journal of Bioinformatics. 9 (2):108-112.
|