stamakro / GCN-for-Structure-and-Function

Code for reproducing results of "Unsupervised embeddings is all you need for protein function prediction"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

This repository contains the source code and data needed to reproduce the results of the paper by Villegas-Morcillo et al.: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa701/5892762 Previous version (pre-print): https://www.biorxiv.org/content/10.1101/2020.04.07.028373v1

Datasets

For each dataset (PDB, SP and CAFA) there is a data_* directory with the training/validation/test protein IDs, the information content (IC) vector and the MFO GO term matrix. All the models_* directories and the feats_pdb for the PDB dataset are available in the 4TU.Centre for Research Data (https://data.4tu.nl/) repository:

https://doi.org/10.4121/uuid:b88d84e1-7408-40d9-a8fc-d734f852dd7a

The SP and CAFA features are not provided due to their large size. They can be generated using the example code or provided upon request.

Main dependencies

The pip environment in which the code was tested can be found in requirements.txt. The main dependencies are:

  • Python 3.6
  • Pytorch 1.2.0
  • Pytorch-geometric 1.3.1
  • Numpy 1.16.4
  • Scikit-learn 0.21.2
  • Biopython 1.74

Feature generation example

The code in Run_example.sh generates a feature dictionary for each protein sample. It contains the protein sequence, the amino acid-level ELMo embeddings, the MFO GO term labels and the protein contact map (only for the PDB dataset).

Neural network training and test example

Train the MLP_E model in the PDB dataset:

python scripts/main.py --phase='train' \
--batch_size=64 --num_epochs=100 --init_lr=0.0005 --lr_sched='True' \
--net_type='mlp' --feats_type='embeddings' --input_dim=1024 --fc_dim=512 \
--num_classes=256 --model_dir=models_pdb/MLP_E \
--train_file=datasets/data_pdb/train.names --valid_file=datasets/data_pdb/valid.names \
--feats_dir=feats_pdb --icvec_file=datasets/data_pdb/icVec.npy

Test the trained MLP_E model in the PDB dataset:

python scripts/main.py --phase='test' \
--net_type='mlp' --feats_type='embeddings' --input_dim=1024 --fc_dim=512 \
--num_classes=256 --feats_dir=feats_pdb --icvec_file=datasets/data_pdb/icVec.npy \
--model_file=models_pdb/MLP_E/model.pth.tar --test_file=datasets/data_pdb/test.names \
--save_file=models_pdb/MLP_E/test_pred.pkl

About

Code for reproducing results of "Unsupervised embeddings is all you need for protein function prediction"


Languages

Language:Python 93.5%Language:Shell 6.5%