Reproducing independent test set results

Question

Reproducing independent test set results

zacps opened this issue 2 years ago · comments

Zac Pullar-Strecker commented 2 years ago

Hi,

I've been trying to reproduce the results of bindpredict (DL only) on the independent test set.

I'm using the provided checkpoints with the following changes (to configure paths):

Diff from master

diff --git a/config.py b/config.py
index 877f745..e39f969 100644
--- a/config.py
+++ b/config.py
@@ -11,13 +11,13 @@ class FileSetter(object):
 
     @staticmethod
     def embeddings_input():
-        return ''
+        return 'embeddings_independent/t5_embeddings/embeddings_converted.h5'
         # TODO set path to embeddings, this should be a .h5-file generated containing per-residue embeddings for all
         #  proteins with key: UniProt-ID, value: embeddings
 
     @staticmethod
     def predictions_folder():
-        return ''  # TODO set path to where predictions should be written
+        return 'predictions'  # TODO set path to where predictions should be written
 
     @staticmethod
     def profile_db():
@@ -58,11 +58,12 @@ class FileSetter(object):
 
     @staticmethod
     def test_ids_in():
-        return 'data/development_set/uniprot_test.txt'  # test ids used during development; available on GitHub
+        with open('data/independent_set/indep_set.txt') as f:
+            return [id_.strip() for id_ in f.readlines()]
 
     @staticmethod
     def fasta_file():
-        return 'data/development_set/all.fasta'  # path to development set; available on GitHub
+        return 'data/independent_set/indep_set.fasta'  # path to development set; available on GitHub
 
     @staticmethod
     def binding_residues_by_ligand(ligand):
diff --git a/data/development_set/uniprot_test.txt b/data/development_set/uniprot_test.txt
index 31dfd9d..3660c04 100644
--- a/data/development_set/uniprot_test.txt
+++ b/data/development_set/uniprot_test.txt
@@ -124,8 +124,6 @@ P50465
 P25524
 P46859
 Q9HAN9
-P84233
-P62799
 P06897
 P02281
 Q9HLQ2
diff --git a/develop_bindEmbed21DL.py b/develop_bindEmbed21DL.py
index c79cca9..a31790f 100644
--- a/develop_bindEmbed21DL.py
+++ b/develop_bindEmbed21DL.py
@@ -12,7 +12,7 @@ def main():
 
     keyword = sys.argv[1]
 
-    path = ''  # TODO set path to working directory
+    path = './trained_models'  # TODO set path to working directory
     Path(path).mkdir(parents=True, exist_ok=True)
 
     cutoff = 0.5

The embeddings were generated with bio_embeddings, though I suspect a newer version as I had to remove a couple of duplicated sequences for it to work.

These are the results from develop_bindEmbed21DL.py testing

Prepare data
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
No residues annotated as binding for Q9LFM3
No residues annotated as binding for Q8A5J2
No residues annotated as binding for P43215
No residues annotated as binding for A0A0B0QJR1
No residues annotated as binding for Q9K943
No residues annotated as binding for P9WPA4
No residues annotated as binding for P0AG51
No residues annotated as binding for A0A0C2W6A5
No residues annotated as binding for Q9SJ89
No residues annotated as binding for Q9BTM1
No residues annotated as binding for Q8LGJ5
No residues annotated as binding for B8H4R9
No residues annotated as binding for P39230
No residues annotated as binding for Q03503
No residues annotated as binding for D0CAL0
No residues annotated as binding for E1C9L3
No residues annotated as binding for K5BJ73
No residues annotated as binding for F0Q4R9
No residues annotated as binding for O94408
No residues annotated as binding for Q58380
No residues annotated as binding for Q9H492
No residues annotated as binding for Q4KCZ1
No residues annotated as binding for Q9I0B9
No residues annotated as binding for Q9KDJ7
No residues annotated as binding for U2EQ00
No residues annotated as binding for P60624
No residues annotated as binding for F0NDX5
No residues annotated as binding for D0CBZ8
No residues annotated as binding for D0CD18
No residues annotated as binding for Q32904
No residues annotated as binding for A0A1Y1BWQ0
No residues annotated as binding for L7T0L4
No residues annotated as binding for Q2G285
No residues annotated as binding for Q709H6
No residues annotated as binding for X2D812
No residues annotated as binding for P17227
No residues annotated as binding for Q07654
No residues annotated as binding for D0CDQ7
overall
CovOneBind: 45 (1.000)
Bound: With predictions: 8, Without predictions: 0
Not Bound: With predictions: 37, Without predictions: 1
TP: 93, FP: 722, TN: 5104, FN: 187
Prec: 0.122 +/- 0.081, Recall: 0.066 +/- 0.051, F1: 0.072 +/- 0.051, MCC: 0.068 +/- 0.046, Acc: 0.817 +/- 0.055
metal
CovOneBind: 32 (1.000)
Bound: With predictions: 3, Without predictions: 0
Not Bound: With predictions: 29, Without predictions: 14
TP: 21, FP: 97, TN: 4425, FN: 23
Prec: 0.050 +/- 0.071, Recall: 0.042 +/- 0.058, F1: 0.045 +/- 0.063, MCC: 0.044 +/- 0.062, Acc: 0.971 +/- 0.009
nucleic
CovOneBind: 17 (1.000)
Bound: With predictions: 2, Without predictions: 0
Not Bound: With predictions: 15, Without predictions: 29
TP: 15, FP: 296, TN: 1542, FN: 16
Prec: 0.032 +/- 0.054, Recall: 0.050 +/- 0.094, F1: 0.038 +/- 0.068, MCC: 0.030 +/- 0.056, Acc: 0.768 +/- 0.143
small
CovOneBind: 34 (0.971)
Bound: With predictions: 6, Without predictions: 1
Not Bound: With predictions: 28, Without predictions: 11
TP: 54, FP: 393, TN: 4438, FN: 172
Prec: 0.112 +/- 0.089, Recall: 0.045 +/- 0.043, F1: 0.057 +/- 0.050, MCC: 0.053 +/- 0.046, Acc: 0.884 +/- 0.028

Obviously these are much worse than the results reported in the paper, what have I done wrong?

Zac Pullar-Strecker · Answer 1 · Thu Oct 13 2022 04:07:27 GMT+0800 (China Standard Time)

I forgot to mention:

I've also retrained the model using develop_bindEmbed21DL.py best-training which doesn't make a difference.

The results reported from its cross validation are similar to the original paper however.

mariasche · Answer 2 · Sun Oct 23 2022 21:44:05 GMT+0800 (China Standard Time)

Hi,

I assume you are using the wrong annotation file. Change this line to return 'data/independent_set/binding_residues_2.5_{}.txt'.format(ligand) and download the respective file from this repository.

Let me know if it still doesn't work.

Zac Pullar-Strecker · Answer 3 · Mon Oct 24 2022 04:54:29 GMT+0800 (China Standard Time)

Thanks! I completely missed that configuration setting.

With that the results are similar to the paper.

Prepare data
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
overall
CovOneBind: 46 (1.000)
Bound: With predictions: 46, Without predictions: 0
Not Bound: With predictions: 0, Without predictions: 0
TP: 272, FP: 596, TN: 5056, FN: 303
Prec: 0.368 +/- 0.071, Recall: 0.600 +/- 0.099, F1: 0.371 +/- 0.060, MCC: 0.348 +/- 0.062, Acc: 0.822 +/- 0.054
metal
CovOneBind: 30 (0.938)
Bound: With predictions: 13, Without predictions: 2
Not Bound: With predictions: 17, Without predictions: 14
TP: 56, FP: 70, TN: 4464, FN: 21
Prec: 0.251 +/- 0.135, Recall: 0.281 +/- 0.150, F1: 0.261 +/- 0.138, MCC: 0.260 +/- 0.139, Acc: 0.979 +/- 0.006
nucleic
CovOneBind: 17 (1.000)
Bound: With predictions: 10, Without predictions: 0
Not Bound: With predictions: 7, Without predictions: 29
TP: 42, FP: 304, TN: 1488, FN: 35
Prec: 0.216 +/- 0.149, Recall: 0.368 +/- 0.200, F1: 0.194 +/- 0.108, MCC: 0.192 +/- 0.104, Acc: 0.752 +/- 0.144
small
CovOneBind: 34 (1.000)
Bound: With predictions: 25, Without predictions: 0
Not Bound: With predictions: 9, Without predictions: 12
TP: 164, FP: 286, TN: 4103, FN: 261
Prec: 0.314 +/- 0.096, Recall: 0.338 +/- 0.120, F1: 0.286 +/- 0.088, MCC: 0.254 +/- 0.086, Acc: 0.883 +/- 0.034