Annotation Cross Products
From GO Public
Each GO annotation refers to a single term from the ontology. This restricts annotators in what they can say - there must be a pre-existing term in the ontology, or one must be requested. It would be far less restrictive if the annotator could combine multiple terms in a single annotation. These terms could even come from other OBO ontologies.
This page describes the proposed new column 16 in the GAF, which allows additional terms to be specified to extend the meaning of an annotation. If an when an annotator chooses to do this, they are effectively creating on "on-the-fly" cross product term. We say "on-the-fly" because the combinatorial term is not added to the ontology (although it could be at a later stage, if the ontology editors choose to do do).
This proposal owes a lot to the MGI structured notes internal field in the MGD database.
Contents |
[edit] External Ontologies required
Only ontologies committed to the principles of the [http:obofoundry.org OBO Foundry] should be included.
- CHEBI : Chemical Entities
- CL : Cell ontology
- taxon-centric anatomy ontologies (AOs):
- ZFA (zebrafish)
- MA (adult mouse)
- FMA (human)
- XAO (xenopus)
- FBbt (fly)
- WBbt (worm)
- (add others here)
[edit] Use Cases
[edit] Function and Process co-annotation
Molecular functions are always executed in the context of a biological process (in a cellular location)
At the moment, we "weakly" co-annotated function and process, but there is no way of knowing which functions go with which processes. A gene G may be annotated to F1, F2, F3 and P1, P2, P3. It may be the case that F1 and P3 never go together, or that when G executes F2 it is always in the context of P2.
Annotators need a way of saying on a per-annotation basis that a F is executed in the context of P.
Example:
F1: protein serine/threonine/tyrosine kinase activity
P1: peptidyl-tyrosine phosphorylation
P2: positive regulation of protein kinase activity
P3: positive regulation of small GTPase mediated signal transduction
F1: sequence-specific DNA binding
lots of Ps, one of which is 'negative regulation of transcription from RNA polymerase II promoter'.
Note that this is complementary to the project to link process and function ontologies. The inter-ontology link could be used as aids to annotators.
[edit] Immune System regulation terms: BP and CL
(see email thread from Evelyn on GO list, "another immune related query GO and CL")
chicken IL-10 is secreted from say.e.g macrophages BUT causes 'negative regulation of interferon gamma biosynthesis' in chicken splenocytes..
TODO: need help refining this use case. It was decided that splenocytes were not a great example
[edit] Subcellular localisation (CC) within a specific type of cell (CL)
- Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
- TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902
Evelyn's comments: So protein localisation is cell type specific and for immune gene GO annotation I think we need to be able to capture this.
Another example:
We want to annotate "localised to nucleus of spermatocyte"
Note that we have some pre-coordinated CC-CL terms in GO. See XP:cellular_component_xp_cell
Example from MGI: TODO
[edit] Regulation of expression and specific gene products
The GO will never pre-coordinate terms such as:
- regulation of oskar mRNA translation
- regulation of oskar mRNA transcription
But this is perfectly appropriate to post-compose this term at annotation time.
The GO term used would be "regulation of transcription/translation"
The properties column would contain an ID for oskar or oskar mRNA. Technically it should be
- a gene ID for "regulation of gene expression"
- a transcript ID for "regulation of transcription"
- a protein ID for "regulation of translation"
However, this can often be difficult. We can relax this so long as we are clear on what it means to provide a gene ID for "regulation of translation"
[edit] Binding
https://sourceforge.net/tracker2/?func=detail&aid=2175326&group_id=36855&atid=440764
[edit] Response to drug (BP + CHEBI)
See tracker item discussion.
We don't want to make children of "response to drug" as this would violate the TP rule ("drugs" do not always play the role of drugs). Instead we would like to indicate when the response to chemical X is a drug-response at annotation time
[edit] Linking together annotations
Question from Emily:
"In addition, would this column be the place to specifically link together annotations from the different GO vocabularies? For instance if you had say, four annotations for protein X which had been annotated to: 'regulation of transcription', 'protein stabilization', 'cytoplasm' and 'nucleus' - a curator might want to link the 'regulation of transcription' process annotation specifically with the cellular component 'nucleus'."
The two options here are:
- group the annotations together somehow, perhaps using a grouping ID.
- redundantly indicate the localisation information
In the second scenario, there would be a normal looking annotation to 'nucleus' with nothing in the properties column. There would also be an annotation to 'regulation of transcription' annotation, and this would have 'nucleus' in the properties column.
[edit] Proposed Solutions
Column 16 of the GAF is used to refine the term used to describe the aspect of the gene product. We will call this the term extension (EXT) column here.
There are two possible solutions on the table. One is simpler to produce and use, but loses information that could potentially be useful. The other is richer and more extensible, but is more difficult to produce and parse. Originally the richer solution was proposed. The simpler solution was added to this page later.
[edit] Simple Solution
The simple solution is to simply to allow a ; or | separated list of IDs in the EXT column. These IDs would be drawn from OBO Foundry ontologies.
Examples
[edit] TLR Example (simple scheme)
- Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
- TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902
pseudo-GAF: (the parts after the ! would not be in the actual file)
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| TLR4 O00206 | perinuclear region (GO:0048471) | PMID:15027902 | CL:new ! immature dendritic cell |
| TLR4 O00206 | cell surface (GO:0005887) | PMID:15027902 | CL:0000576 ! monocyte |
notes: we lose the only quantifier. We have no way in this scheme of distinguishing CC localization that only happen in certain cell types vs those that sometimes happen in cell types. But we may rarely know the only cases.
Note also there is an implicit part_of relation between the CC and the CL
[edit] Anatomy example (simple scheme)
A process that happens in an anatomical location:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| CREB | GO:0006094 ! gluconeogenesis | PMID:nnnn | MA:0000358 ! liver |
[edit] Response to drug (simple scheme)
There are different options for "response to cocaine as drug".
Option 1:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| moody (FBgn0025631) | GO:0042220 ! response to cocaine | PMID:nnnn | CHEBI:23888 ! drug |
This one is problematic as we would rather use CHEBI as the authoratitative source on chemical structures rather than roles
Option 1b:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| moody (FBgn0025631) | GO:0042493 ! response to drug | PMID:nnnn | CHEBI:27958 ! cocaine |
This is not ideal either as the annotation minus the EXT is not very informative.
Option 2. Here we use 2 GO terms
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| moody (FBgn0025631) | GO:0042220 ! response to cocaine | PMID:nnnn | GO:0042493 ! response to drug |
Not ideal as software is forced to use the EXT column to get response to drug.
Option 2b - we redundantly use both
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| moody (FBgn0025631) | GO:0042220 ! response to cocaine | PMID:nnnn | GO:0042493 ! response to drug |
| moody (FBgn0025631) | GO:0042493 ! response to drug | PMID:nnnn | GO:0042220 ! response to cocaine |
[edit] Multiple localizations example (simple)
What if the publication describes separate observations - perhaps one for biopolar neuron and one for Purkinje cell?
We can separate these using |. This is equivalent to splitting the annotation over two lines. For example:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | CL:0000121 PIPE CL:0000103 ! biopolar neuron & Purkinje cell |
(I can't figure out how to include a pipe in a wiki table so I just wrote PIPE)
[edit] Localizations to a CC within a CL within a gross anatomical location (simple scheme)
We use "," to separate multiple extension for the same instance:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| MGI:1919277 Slc39a4 | GO:0016324 ! apical plasma membrane | PMID:nnnn | EMAP:6894,CL:0000223 ! enterocyte of small intestine AND endodermal cell of TS22\,extraembryonic component |
Note we can't use a single separator as we may end up with interpretations liuke "endodermal cell of small intenstine"
This could also be written as:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| MGI:1919277 Slc39a4 | GO:0016324 ! apical plasma membrane | PMID:nnnn | MA:0000337,CL:0000584 ! enterocyte of small intestine |
| MGI:1919277 Slc39a4 | GO:0016324 ! apical plasma membrane | PMID:nnnn | EMAP:6894,CL:0000223 ! endodermal cell of TS22\,extraembryonic component |
[edit] Advantages of the simple scheme
Advantages of the simple EXT scheme:
- Simple for GAF providers
- Simple for GAF consumers
Disadvantages:
- Lack of expressivity
The main disadvantage of this scheme is the fact we are not expressing what the relationship is between the MF/BP/CC referenced in col4 and the entity in the EXT column. This is less of a problem for humans, who are skilled at guessing the context. It is a problem for computers.
[edit] Expressive Solution
The more expressive solution allows us to specify the relationship between col4 and col16
The basic syntax we would use would be:
RELATION '(' OBO-ID ')'
The relation would be drawn from the OBO relation ontology.
Note that this syntax is fully generalisable to all kinds of cross products,
For example
- part_of(CL:0000017)
This is the CL ID for "spermatocyte". If the GO term in the annotation was for "nucleus", then the overall meaning of the annotation would be "a nucleus that is in a spermatocyte"
[edit] BP-MF Example (with relations)
pseudo-GAF (the parts after the ! would not be in the actual file, we are just including them here to make the examples readable)
Here is gene 1234 that executes GTPase activity as part of an intracellular signaling cascade
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| Gene1234 | GO:0003924 ! GTPase activity | PMID:nnnn | part_of(GO:0007242) ! intracellular signaling cascade |
| Gene1234 | GO:0007242 ! intracellular signaling cascade | PMID:nnnn | (empty) |
[edit] CC-CL Example (with relations)
pseudo-GAF (the parts after the ! would not be in the actual file, we are just including them here to make the examples readable)
Here is an imaginary gene localized to the mitochondrial membrane in a spermatocyte:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000017) ! spermatocyte |
[edit] BP x anatomy example (with relations)
Example of a gene product executing its function in a particular location. Here we use the unfolds_in relation:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| CREB | GO:0006094 ! gluconeogenesis | PMID:nnnn | unfolds_in(MA:0000358) ! liver |
[edit] TLR example (with relations)
- Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
- TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| TLR4 O00206 | perinuclear region (GO:0048471) | PMID:15027902 | part_of(CL:new) ! immature dendritic cell |
| TLR4 O00206 | cell surface (GO:0005887) | PMID:15027902 | part_of(CL:0000576) ! monocyte |
[edit] Multiple localizations example (with relations)
What if the publication describes separate observations - perhaps one for biopolar neuron and one for Purkinje cell?
We can separate these using |. This is equivalent to splitting the annotation over two lines. For example:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000121) PIPE part_of(CL:0000103) ! biopolar neuron & Purkinje cell |
The "|" separator indicates that this is a separate localization of a different instance of this gene product.
(I can't figure out how to include a pipe in a wiki table so I just wrote PIPE)
What if we want to annotate two separate observations of the same subcellular localization - one from an astrocyte of the hippocampus, the other from a B cell in the lymph?
We use the "," to indicate an additional extension for the same observation. So the above would be:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000127),part_of(MA:0000953) PIPE part_of(CL:0000236),part_of(MA:0002520) ! one from an astrocyte of the hippocampus, the other from a B cell in the lymph |
This would be equivalent to two annotations
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000127),part_of(MA:0000953) ! astrocyte of the hippocampus |
| Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000236),part_of(MA:0002520) ! a B cell in the lymph |
Here is another, real life example from MGI:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| MGI:1919277 Slc39a4 | GO:0016324 ! apical plasma membrane | PMID:nnnn | part_of(MA:0000337),part_of(CL:0000584) ! enterocyte of small intestine |
| MGI:1919277 Slc39a4 | GO:0016324 ! apical plasma membrane | PMID:nnnn | part_of(EMAP:6894),part_of(CL:0000223) ! endodermal cell of TS22\,extraembryonic component |
[edit] Response to drug (with relations)
There are different options for "response to cocaine as drug".
Option 1:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| moody (FBgn0025631) | GO:0042220 ! response to cocaine | PMID:nnnn | response_to(CHEBI:23888) ! drug |
Here we need a new relation, "response_to"
Another solution:
| Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
|---|---|---|---|
| moody (FBgn0025631) | GO:0042220 ! response to cocaine | PMID:nnnn | is_a(GO:0042493) ! response to drug |
[edit] Implementation Plan
- test annotation files will be made available to Berkeley (contributors: MGI, GOA, Dicty...?) with col16 populated
- Berkeley will populate a test database (Seth)
- toy version of AmiGO with CL IDs queryable
- change schema of production db
- officially add spec for col16
- annotation contributors start adding columns
- CL populated and queryable in public amigo
- Extend scheme to other OBO ontologies
the toy v of amigo should be ready by the GO meeting
[edit] Database Implementation
See SWUG:Database
[edit] FAQ
[edit] Will this replace existing combinatorial GO terms like "B cell differentiation"
No! It is important to keep terms like this pre-coordinated in the GO
[edit] When do I request a new term and when do I use the properties column?
Request a new term if it seems like a sensible new term to have in GO. Combinatorial terms in GO are fine if it corresponds to a commonly used scientific term, and the combination is not completely arbitrary and accidental.
[edit] Appendix
[edit] Grammar for col 16
This is specified as a BNF grammar. This is necessary to keep the field extensible enough for future use. Note that the column is optional, so there is no requirements for people to parse it. It is an 'added bonus' column
PropertiesSet := Properties | Properties "|" PropertiesSet
Properties := Property | Property ',' Properties
Property := Relation '(' Term ')'
Term := ID
Relation := Relation-Abbrev | ID
ID := ID-Space ':' Local-ID
ID-Space := XML-NMToken
Local-ID := chars
Relation-Abbrev := chars
Relations can be abbreviated; eg part_of can be used in place of OBO_REL:part_of
This can be extended to allow for nested expressions:
Term := ID | ID '^' Properties
