GraphQA: Protein Model Quality Assessment using Graph Convolutional Networks

Evaluation server

Try it yourself! A simple implementation of an evaluation server is available at this link.

Initial setup

Clone repository, install dependencies in a conda environment, install GraphQA:

git clone https://github.com/baldassarreFe/graphqa
cd graphqa

export PATH="/usr/local/cuda/bin:${PATH}"
export CPATH="/usr/local/cuda/include:${CPATH}"
conda env create -n graphqa -f conda.yaml
conda activate graphqa
pip install .

Prediction

To make predictions using GraphQA, follow the instructions in predictions.md.

Datasets

Manual download and preprocessing

The file notebooks/README.md contains all information to download and preprocess CASP data for training GraphQA. At a high level, the necessary steps are:

Download protein sequences, official native structures, submitted decoy structures, submitted QA predictions, and official QA scores from the CASP website
Run DSSP on all submitted tertiary structures to extract secondary structure features
Run JackHMMER on all protein sequences to compute multiple-sequence alignment features against UniRef50
Score all decoys with respect to the respective native structures, specifically computing:
- per-residue: CAD and LDDT scores
- per-decoy: GDT_TS, GDT_TS, TM, CAD, LDDT scores
Transform each decoy into a graph data structure suitable for training with PyTorch, including all input and output features computed in the steps above. At this stage, geometric and sequential features are also added to the graph (edges, distances and angles) to avoid computing them during training.

First, run the DownloadCaspData notebook to download raw protein data from the CASP website.

Then, prepare all preprocessing tools (some of them require a compilation step, others run in Docker):

# Docker image for DSSP
docker build -t dssp 'https://github.com/cmbi/dssp.git#697deab74011bfbd55891e9b8d5d47b8e4ef0e38'

# Sequence database for JackHMMER
wget 'ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz'
tar xzf 'uniref50.fasta.gz'

# Docker image for LDDT score
docker pull 'registry.scicore.unibas.ch/schwede/openstructure:2.1.0'

# Voronota binaries for CAD score
wget 'https://github.com/kliment-olechnovic/voronota/releases/download/v1.21.2744/voronota_1.21.2744.tar.gz'
tar xzf 'voronota_1.21.2744.tar.gz'

# TMscore source for GDT_TS, GDT_HA, TM scores
wget 'https://zhanglab.ccmb.med.umich.edu/TM-score/TMscore.cpp'
g++ -static -O3 -ffast-math -lm -o TMscore TMscore.cpp

Run preprocessing for training:

for CASP in data/CASP{9..13}; do
  python -m graphqa.data.preprocess "$CASP" "uniref50.fasta" \
    --train \
    --tmscore "./TMscore" \
    --voronota "./voronota_1.21.2744/voronota-cadscore"
done

Download preprocessed data

Downloading the data and running the preprocessing steps described above can take a long time. To skip these steps and directly download the dataset used for training:

BASE_URL='https://kth.box.com/shared/static/'
wget -O GraphQA-CASP9.tar.gz  "${BASE_URL}fm2weje86d7nvulbconzf3pzmmhl2tmm.gz"
wget -O GraphQA-CASP10.tar.gz "${BASE_URL}jdgns10ehenjur1y5dw2lj275aggeu33.gz"
wget -O GraphQA-CASP11.tar.gz "${BASE_URL}tls5yxhsycqpid8pp6i3jv7ew7h0xz6l.gz"
wget -O GraphQA-CASP12.tar.gz "${BASE_URL}cbm3k5ladnq5i42q5fdcbztxwaukde9x.gz"
wget -O GraphQA-CASP13.tar.gz "${BASE_URL}f66fjw67urwxcovfrpar5jd4diyayshl.gz"

Extract the contents of the tar archives in the corresponding folders under /data.

Training

Either train with a predefined configuration

python -m proteins.train config/train.yaml --model config/model.yaml --session config/session.yaml [in_memory=yes]

Or define all parameters manually

# Data
cutoff=10
partial_entropy=no
self_information=no
dssp=no

# Model
model_fn=proteins.networks.ProteinGN
layers=6
min_dist=0
max_dist=20
rbf_size=16
residue_emb_size=64
separation_enc=categorical
distance_enc=rbf
mp_in_edges=128
mp_in_nodes=512
mp_in_globals=512
mp_out_edges=16
mp_out_nodes=64
mp_out_globals=32
dropout=.2
batch_norm=no

# Losses
loss_local_lddt=5
loss_global_gdtts=5

# Optimizer
opt_fn=torch.optim.Adam
learning_rate=.001
weight_decay=.00001

# Session
max_epochs=10
batch_size=1000
datasets='[data/CASP7,data/CASP8,data/CASP9,data/CASP10]'
logs='~/proteins/runs'

tags=()
tags+=("residueonly")
tags+=("l${layers}")
tags+=("${mp_in_edges}-${mp_in_nodes}-${mp_in_globals}")
tags+=("${mp_out_edges}-${mp_out_nodes}-${mp_out_globals}")
tags+=("dr${dropout}")
tags+=("bn${batch_norm}")
tags+=("lr${learning_rate}")
tags+=("wd${weight_decay}")
tags+=("ll${loss_local_lddt}")
tags+=("lg${loss_global_gdtts}")
tags+=("co${cutoff}")
tags+=("res${residue_emb_size}")
tags+=("rbf${rbf_size}")
tags+=("sep${separation_enc}")
tags+=("dist${distance_enc}")
tags="[$(IFS=, ; echo "${tags[*]}")]"

python -m proteins.train \
    tags="${tags}" \
    --data \
        cutoff="${cutoff}" \
        partial_entropy="${partial_entropy}" \
        self_information="${self_information}" \
        dssp="${dssp}" \
    --model \
        fn="${model_fn}" \
        layers="${layers}" \
        dropout="${dropout}" \
        batch_norm="${batch_norm}" \
        min_dist="${min_dist}" \
        max_dist="${max_dist}" \
        rbf_size="${rbf_size}" \
        residue_emb_size="${residue_emb_size}" \
        separation_enc="${separation_enc}" \
        distance_enc="${distance_enc}" \
        mp_in_edges="${mp_in_edges}" \
        mp_in_nodes="${mp_in_nodes}" \
        mp_in_globals="${mp_in_globals}" \
        mp_out_edges="${mp_out_edges}" \
        mp_out_nodes="${mp_out_nodes}" \
        mp_out_globals="${mp_out_globals}" \
    --loss.local_lddt \
        name=mse \
        weight="${loss_local_lddt}" \
    --loss.global_gdtts \
        name=mse \
        weight="${loss_global_gdtts}" \
    --optimizer \
        fn="${opt_fn}" \
        lr="${learning_rate}" \
        weight_decay="${weight_decay}" \
    --session.data \
        trainval="${datasets}" \
        split=35 \
        in_memory=yes \
    --session.logs \
        folder="${logs}" \
    --session \
        cpus=1 \
        checkpoint=2 \
        max_epochs="${max_epochs}" \
        batch_size="${batch_size}"

Logs and checkpoints can be found in runs:

tensorboard --logdir runs

Ablation studies

Config files for ablation studies are self-contained and can just be run as:

NUM_RUNS_PER_STUDY=5
for f in config/ablations/{nodes,edges,layersvscutoff,architecture,localglobalscore,separation_encoding}/*.yaml; do
    for i in $(seq ${NUM_RUNS_PER_STUDY}); do
        python -m proteins.train "${f}"
    done
done

Testing

Test GraphQA with all features (residues, multiple-sequence alignment, DSSP):

RUN_PATH='runs/l6_128-512-512_16-64-32_res64_rbf32_sepcategorical_dr.2_bnno_lr.001_wd.00001_ll1_lg1_lr0_co8_allfeats_wonderful_mclean'
for data in $(find 'data/' -maxdepth 1 -mindepth 1 -type d); do
    python -m proteins.test \
      "${RUN_PATH}/experiment.latest.yaml" \
      --model state_dict="${RUN_PATH}/model.latest.pt" \
      --test \
        data.input="${data}" \
        data.output="results/allfeatures/$(basename "${data}")" \
        data.in_memory=yes \
        cpus=1 \
        batch_size=200 
done

Test GraphQA with residue identity features only:

RUN_PATH='runs/residueonly_l8_128-512-512_16-64-64_dr.1_bnno_lr.001_wd.00001_ll1_ll5_co8_priceless_hawking'
for data in $(find 'data/' -maxdepth 1 -mindepth 1 -type d); do
    python -m proteins.test \
      "${RUN_PATH}/experiment.latest.yaml" \
      --model state_dict="${RUN_PATH}/model.latest.pt" \
      --test \
        data.input="${data}" \
        data.output="results/residueonly/$(basename "${data}")" \
        data.in_memory=yes \
        cpus=1 \
        batch_size=200 
done

xvshiting / graphqa