Annotation Cross Products

From GO Public

(Redirected from Annotation Cross products)
Jump to: navigation, search

Each GO annotation refers to a single term from the ontology. This restricts annotators in what they can say - there must be a pre-existing term in the ontology, or one must be requested. It would be far less restrictive if the annotator could combine multiple terms in a single annotation. These terms could even come from other OBO ontologies.

This page describes the proposed new column 16 in the GAF, which allows additional terms to be specified to extend the meaning of an annotation. If an when an annotator chooses to do this, they are effectively creating on "on-the-fly" cross product term. We say "on-the-fly" because the combinatorial term is not added to the ontology (although it could be at a later stage, if the ontology editors choose to do do).

This proposal owes a lot to the MGI structured notes internal field in the MGD database.

Contents

[edit] External Ontologies required

Only ontologies committed to the principles of the [http:obofoundry.org OBO Foundry] should be included.

  • CHEBI : Chemical Entities
  • CL : Cell ontology
  • taxon-centric anatomy ontologies (AOs):
    • ZFA (zebrafish)
    • MA (adult mouse)
    • FMA (human)
    • XAO (xenopus)
    • FBbt (fly)
    • WBbt (worm)
    • (add others here)

[edit] Use Cases

[edit] Function and Process co-annotation

Molecular functions are always executed in the context of a biological process (in a cellular location)

At the moment, we "weakly" co-annotated function and process, but there is no way of knowing which functions go with which processes. A gene G may be annotated to F1, F2, F3 and P1, P2, P3. It may be the case that F1 and P3 never go together, or that when G executes F2 it is always in the context of P2.

Annotators need a way of saying on a per-annotation basis that a F is executed in the context of P.

Example:

F1: protein serine/threonine/tyrosine kinase activity

P1: peptidyl-tyrosine phosphorylation

P2: positive regulation of protein kinase activity

P3: positive regulation of small GTPase mediated signal transduction


F1: sequence-specific DNA binding

lots of Ps, one of which is 'negative regulation of transcription from RNA polymerase II promoter'.



Note that this is complementary to the project to link process and function ontologies. The inter-ontology link could be used as aids to annotators.

[edit] Immune System regulation terms: BP and CL

(see email thread from Evelyn on GO list, "another immune related query GO and CL")

chicken IL-10 is secreted from say.e.g macrophages BUT causes 'negative regulation of interferon gamma biosynthesis' in chicken splenocytes..

TODO: need help refining this use case. It was decided that splenocytes were not a great example

[edit] Subcellular localisation (CC) within a specific type of cell (CL)

  • Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
  • TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902

Evelyn's comments: So protein localisation is cell type specific and for immune gene GO annotation I think we need to be able to capture this.

Another example:

We want to annotate "localised to nucleus of spermatocyte"

Note that we have some pre-coordinated CC-CL terms in GO. See XP:cellular_component_xp_cell

Example from MGI: TODO

[edit] Regulation of expression and specific gene products

The GO will never pre-coordinate terms such as:

  • regulation of oskar mRNA translation
  • regulation of oskar mRNA transcription

But this is perfectly appropriate to post-compose this term at annotation time.

The GO term used would be "regulation of transcription/translation"

The properties column would contain an ID for oskar or oskar mRNA. Technically it should be

  • a gene ID for "regulation of gene expression"
  • a transcript ID for "regulation of transcription"
  • a protein ID for "regulation of translation"

However, this can often be difficult. We can relax this so long as we are clear on what it means to provide a gene ID for "regulation of translation"

[edit] Binding

https://sourceforge.net/tracker2/?func=detail&aid=2175326&group_id=36855&atid=440764

[edit] Response to drug (BP + CHEBI)

See tracker item discussion.

We don't want to make children of "response to drug" as this would violate the TP rule ("drugs" do not always play the role of drugs). Instead we would like to indicate when the response to chemical X is a drug-response at annotation time

[edit] Linking together annotations

Question from Emily:

"In addition, would this column be the place to specifically link together annotations from the different GO vocabularies? For instance if you had say, four annotations for protein X which had been annotated to: 'regulation of transcription', 'protein stabilization', 'cytoplasm' and 'nucleus' - a curator might want to link the 'regulation of transcription' process annotation specifically with the cellular component 'nucleus'."

The two options here are:

  1. group the annotations together somehow, perhaps using a grouping ID.
  2. redundantly indicate the localisation information

In the second scenario, there would be a normal looking annotation to 'nucleus' with nothing in the properties column. There would also be an annotation to 'regulation of transcription' annotation, and this would have 'nucleus' in the properties column.

[edit] Proposed Solutions

Column 16 of the GAF is used to refine the term used to describe the aspect of the gene product. We will call this the term extension (EXT) column here.

There are two possible solutions on the table. One is simpler to produce and use, but loses information that could potentially be useful. The other is richer and more extensible, but is more difficult to produce and parse. Originally the richer solution was proposed. The simpler solution was added to this page later.

[edit] Simple Solution

The simple solution is to simply to allow a ; or | separated list of IDs in the EXT column. These IDs would be drawn from OBO Foundry ontologies.

Examples

[edit] TLR Example (simple scheme)

  • Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
  • TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902

pseudo-GAF: (the parts after the ! would not be in the actual file)

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
TLR4 O00206 perinuclear region (GO:0048471) PMID:15027902 CL:new ! immature dendritic cell
TLR4 O00206 cell surface (GO:0005887) PMID:15027902 CL:0000576 ! monocyte

notes: we lose the only quantifier. We have no way in this scheme of distinguishing CC localization that only happen in certain cell types vs those that sometimes happen in cell types. But we may rarely know the only cases.

Note also there is an implicit part_of relation between the CC and the CL

[edit] Anatomy example (simple scheme)

A process that happens in an anatomical location:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
CREB GO:0006094 ! gluconeogenesis PMID:nnnn MA:0000358 ! liver

[edit] Response to drug (simple scheme)

There are different options for "response to cocaine as drug".

Option 1:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
moody (FBgn0025631) GO:0042220 ! response to cocaine PMID:nnnn CHEBI:23888 ! drug

This one is problematic as we would rather use CHEBI as the authoratitative source on chemical structures rather than roles

Option 1b:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
moody (FBgn0025631) GO:0042493 ! response to drug PMID:nnnn CHEBI:27958 ! cocaine

This is not ideal either as the annotation minus the EXT is not very informative.

Option 2. Here we use 2 GO terms

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
moody (FBgn0025631) GO:0042220 ! response to cocaine PMID:nnnn GO:0042493 ! response to drug

Not ideal as software is forced to use the EXT column to get response to drug.

Option 2b - we redundantly use both

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
moody (FBgn0025631) GO:0042220 ! response to cocaine PMID:nnnn GO:0042493 ! response to drug
moody (FBgn0025631) GO:0042493 ! response to drug PMID:nnnn GO:0042220 ! response to cocaine

[edit] Multiple localizations example (simple)

What if the publication describes separate observations - perhaps one for biopolar neuron and one for Purkinje cell?

We can separate these using |. This is equivalent to splitting the annotation over two lines. For example:


Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn CL:0000121 PIPE CL:0000103 ! biopolar neuron & Purkinje cell

(I can't figure out how to include a pipe in a wiki table so I just wrote PIPE)

[edit] Localizations to a CC within a CL within a gross anatomical location (simple scheme)

We use "," to separate multiple extension for the same instance:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
MGI:1919277 Slc39a4 GO:0016324 ! apical plasma membrane PMID:nnnn EMAP:6894,CL:0000223 ! enterocyte of small intestine AND endodermal cell of TS22\,extraembryonic component

Note we can't use a single separator as we may end up with interpretations liuke "endodermal cell of small intenstine"

This could also be written as:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
MGI:1919277 Slc39a4 GO:0016324 ! apical plasma membrane PMID:nnnn MA:0000337,CL:0000584 ! enterocyte of small intestine
MGI:1919277 Slc39a4 GO:0016324 ! apical plasma membrane PMID:nnnn EMAP:6894,CL:0000223 ! endodermal cell of TS22\,extraembryonic component

[edit] Advantages of the simple scheme

Advantages of the simple EXT scheme:

  • Simple for GAF providers
  • Simple for GAF consumers

Disadvantages:

  • Lack of expressivity

The main disadvantage of this scheme is the fact we are not expressing what the relationship is between the MF/BP/CC referenced in col4 and the entity in the EXT column. This is less of a problem for humans, who are skilled at guessing the context. It is a problem for computers.

[edit] Expressive Solution

The more expressive solution allows us to specify the relationship between col4 and col16

The basic syntax we would use would be:

 RELATION '(' OBO-ID ')'

The relation would be drawn from the OBO relation ontology.

Note that this syntax is fully generalisable to all kinds of cross products,

For example

This is the CL ID for "spermatocyte". If the GO term in the annotation was for "nucleus", then the overall meaning of the annotation would be "a nucleus that is in a spermatocyte"

[edit] BP-MF Example (with relations)

pseudo-GAF (the parts after the ! would not be in the actual file, we are just including them here to make the examples readable)

Here is gene 1234 that executes GTPase activity as part of an intracellular signaling cascade

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0003924 ! GTPase activity PMID:nnnn part_of(GO:0007242) ! intracellular signaling cascade
Gene1234 GO:0007242 ! intracellular signaling cascade PMID:nnnn (empty)

[edit] CC-CL Example (with relations)

pseudo-GAF (the parts after the ! would not be in the actual file, we are just including them here to make the examples readable)

Here is an imaginary gene localized to the mitochondrial membrane in a spermatocyte:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000017) ! spermatocyte

[edit] BP x anatomy example (with relations)

Example of a gene product executing its function in a particular location. Here we use the unfolds_in relation:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
CREB GO:0006094 ! gluconeogenesis PMID:nnnn unfolds_in(MA:0000358) ! liver


[edit] TLR example (with relations)

  • Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
  • TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902


Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
TLR4 O00206 perinuclear region (GO:0048471) PMID:15027902 part_of(CL:new) ! immature dendritic cell
TLR4 O00206 cell surface (GO:0005887) PMID:15027902 part_of(CL:0000576) ! monocyte

[edit] Multiple localizations example (with relations)

What if the publication describes separate observations - perhaps one for biopolar neuron and one for Purkinje cell?

We can separate these using |. This is equivalent to splitting the annotation over two lines. For example:


Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000121) PIPE part_of(CL:0000103) ! biopolar neuron & Purkinje cell

The "|" separator indicates that this is a separate localization of a different instance of this gene product.


(I can't figure out how to include a pipe in a wiki table so I just wrote PIPE)


What if we want to annotate two separate observations of the same subcellular localization - one from an astrocyte of the hippocampus, the other from a B cell in the lymph?

We use the "," to indicate an additional extension for the same observation. So the above would be:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000127),part_of(MA:0000953) PIPE part_of(CL:0000236),part_of(MA:0002520) ! one from an astrocyte of the hippocampus, the other from a B cell in the lymph

This would be equivalent to two annotations

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000127),part_of(MA:0000953) ! astrocyte of the hippocampus
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000236),part_of(MA:0002520) ! a B cell in the lymph

Here is another, real life example from MGI:


Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
MGI:1919277 Slc39a4 GO:0016324 ! apical plasma membrane PMID:nnnn part_of(MA:0000337),part_of(CL:0000584) ! enterocyte of small intestine
MGI:1919277 Slc39a4 GO:0016324 ! apical plasma membrane PMID:nnnn part_of(EMAP:6894),part_of(CL:0000223) ! endodermal cell of TS22\,extraembryonic component

[edit] Response to drug (with relations)

There are different options for "response to cocaine as drug".

Option 1:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
moody (FBgn0025631) GO:0042220 ! response to cocaine PMID:nnnn response_to(CHEBI:23888) ! drug

Here we need a new relation, "response_to"

Another solution:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
moody (FBgn0025631) GO:0042220 ! response to cocaine PMID:nnnn is_a(GO:0042493) ! response to drug


[edit] Implementation Plan

  1. test annotation files will be made available to Berkeley (contributors: MGI, GOA, Dicty...?) with col16 populated
  2. Berkeley will populate a test database (Seth)
  3. toy version of AmiGO with CL IDs queryable
  4. change schema of production db
  5. officially add spec for col16
  6. annotation contributors start adding columns
  7. CL populated and queryable in public amigo
  8. Extend scheme to other OBO ontologies

the toy v of amigo should be ready by the GO meeting

[edit] Database Implementation

See SWUG:Database


[edit] FAQ

[edit] Will this replace existing combinatorial GO terms like "B cell differentiation"

No! It is important to keep terms like this pre-coordinated in the GO

[edit] When do I request a new term and when do I use the properties column?

Request a new term if it seems like a sensible new term to have in GO. Combinatorial terms in GO are fine if it corresponds to a commonly used scientific term, and the combination is not completely arbitrary and accidental.

[edit] Appendix

[edit] Grammar for col 16

This is specified as a BNF grammar. This is necessary to keep the field extensible enough for future use. Note that the column is optional, so there is no requirements for people to parse it. It is an 'added bonus' column

 PropertiesSet := Properties | Properties "|" PropertiesSet
 Properties := Property | Property ',' Properties
 Property := Relation '(' Term ')'
 Term := ID
 Relation := Relation-Abbrev | ID
 ID := ID-Space ':' Local-ID
 ID-Space := XML-NMToken
 Local-ID := chars
 Relation-Abbrev := chars

Relations can be abbreviated; eg part_of can be used in place of OBO_REL:part_of

This can be extended to allow for nested expressions:

 Term := ID | ID '^' Properties
Personal tools