manastech / cafa5

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Function to download protein known GO terms from uniprot

leandroradusky opened this issue · comments

In order to validate our models, we should be able to know the actually assigned GO terms given an Uniprot accession and check the probabilities we assign to them.

The Uniprot resource offers for each entry an XML file with all the information available for it. It can be accessed at https://rest.uniprot.org/uniprotkb/<PROT_ACCESSION>.xml

I.e P68510, one of the proteins of the competition's test superset, have plenty of GO terms already assigned. Here is the page displaying all the info contained on the XML beautifully & linked to each resource.

We should have a functionality in our package that given a UniProt accession, it returns the list of GO terms available to it. When we start to incorporate more information into our models, we will use these XMLs in order to access other resources (the protein sequence, its structure, etc)

Then, it will be good to have a Protein class in our package to access this data. We should be able to do:

import Protein from manas-cafa5

my_protein = Protein('P68510`)

And then my_protein.go_terms() should return a list containing all the Gene Ontology identifiers included in the XML file.
In the future, we will access other data, as the structure identifiers of it, and we will have as well a Structure class that loads the structure, encode it to feed our models, etc.