Function to download protein known GO terms from uniprot
leandroradusky opened this issue · comments
In order to validate our models, we should be able to know the actually assigned GO terms given an Uniprot accession and check the probabilities we assign to them.
The Uniprot resource offers for each entry an XML file with all the information available for it. It can be accessed at https://rest.uniprot.org/uniprotkb/<PROT_ACCESSION>.xml
I.e P68510, one of the proteins of the competition's test superset, have plenty of GO terms already assigned. Here is the page displaying all the info contained on the XML beautifully & linked to each resource.
We should have a functionality in our package that given a UniProt accession, it returns the list of GO terms available to it. When we start to incorporate more information into our models, we will use these XMLs in order to access other resources (the protein sequence, its structure, etc)
Then, it will be good to have a Protein
class in our package to access this data. We should be able to do:
import Protein from manas-cafa5
my_protein = Protein('P68510`)
And then my_protein.go_terms()
should return a list containing all the Gene Ontology identifiers included in the XML file.
In the future, we will access other data, as the structure identifiers of it, and we will have as well a Structure
class that loads the structure, encode it to feed our models, etc.