HPOFiller

HPOFiller: identifying missing protein-phenotype associations by graph convolutional network

Please run the programs in order.

Dependencies

Our model is implemented by Python 3.6 with Pytorch 1.4.0 and Pytorch-geometric 1.5.0, and run on Nvidia GPU with CUDA 10.0.

extract_gene_id.py: First, please download gene annotations file from http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/ with all sources and all frequencies: ALL_SOURCES_ALL_FREQUENCIES_genes_to_phenotype.txt. Then run the script, you will get a .txt file containing all gene ids. Finally, please upload this file to http://www.uniprot.org/mapping/ to map Entrez Gene ID to UniProt ID.

create_annotation.py: After generating Gene ID mapping file, you can run this script to generate protein-HPO annotations file without propagation. The output json file contains leaf annotations of each protein, like
```
 { protein_id1: [ hpo_term1, hpo_term2, ... ],
   protein_id2: [ hpo_term1, hpo_term2, ... ],
  ...
 }
```
create_auxiliary_file_pa.py : Now, you can create necessary auxiliary files, including:
- protein list: a json file containing all protein IDs
- term list: a json file containing all HPO terms (used to annotate at least one protein) Note that we only keep HPO terms in PA sub-ontology.
split_train_test_pa.py : Run this script to split n_folds folds and then generate n_folds mask files which contain train and test mask.

split_temporal_dataset_pa.py: make necessary datasets along with the time. Note that we only consider HPO terms in PA.

Here the scores are scaled to [0, 1].

train.py: We provide two modes:
- Cross-validation: The program will conduct 10-folds CV and output corresponding predictions. Please set 'mode' in config file as 'cv'.
- Temporal validation: This is a simulated real scene. The model will predict missing protein-HPO term associations based on current HPO annotations. Please set 'mode' in config file as 'single'.

We upload the prediction results made by HPOFiller for the HPO annotation released by 2019-02-12. The data is available at:

This file is so large (585.8 MB), which contains the rank, UniProt ID of protein, HPO term ID and the predictive score. You are free to download it.