How does predict-eval work

Question

How does predict-eval work

skodapetr opened this issue 4 years ago · comments

Petr Škoda commented 4 years ago

Running on 2 datasets, namely:

coach420.ds
coach420(mlig).ds
this lead to different results DCA_4_2 being 77.1 and 75.7 respectively.

How can run on coach420.ds determine the ligand bindings sites for evaluation, as they are not provided in the ds file?

If it can, why is there a difference in the prediction performance?

Besides in ```runs_pred.csv`` it is not clear why DCA value is used when the reported results use what seems to be DCC.

Also, the result using eval_predict_coach420-mlig are 71.2 and 75.7 compared to 72.0 78.3 reported in the article. Is it possible to reproduce the article results and if so, how?

Is there a traineval/optimization command used to train models to use to train the reported result?

rdk · Answer 1 · Mon May 25 2020 16:42:26 GMT+0800 (China Standard Time)

Both datasets [coach420.ds, coach420(mlig).ds] contain a list proteins with ligands and can be used for evaluation or training. If a dataset does not contain explicitly specified relevant ligands (as is the case in coach420.ds), ligands are loaded from pdb files and relevant ligands are determined by a filter (for instance ligands that are too small are removed). This is described in detail in Supplementary documents to https://doi.org/10.1186/s13321-018-0285-8 and https://doi.org/10.1093/nar/gkz424 . Difference in prediction performance is due to the fact that the sets of relevant ligands are not the same in those two datasets.

This is partially described in p2rank-datasets/README.md . Improvements to this readme file are welcome.

Is it possible to reproduce the article results and if so, how?

This depends on which experiments you run. Results in the article are results of one particular trained RF model that was selected as a default model. If you use prank predict-eval with the default model, you should get the same results. If you run training prank train-eval with -loop 10 this will train 10 new models and it will produce the average of those 10 runs - so the results may be slightly different. But even results of prank predict-eval may be slightly different from the original release due to bugfixes that were implemented since then. The only way how to exactly reproduce the results is to use the old version of the codebase. Even then, training of RF models may not be completely deterministic even when you use the same random seed. This depends on what RF implementation you use. FastRandomForest is not deterministic due to the particular implementation of parallelism (i.e. it is deterministic only if you run it in 1 thread).

DCA is the correct measure in which the results are reported. There is an error in JChem article the resulted from bad compilation of tex file I have discovered only very recently.

In any case, check out the Supplementary documents to the mentioned articles, they contain more detailed and correct results.

Petr Škoda · Answer 2 · Wed May 27 2020 17:20:09 GMT+0800 (China Standard Time)

Thanks for the response. The question regarding reproducible result was not about training model, but about evaluation of the model. As I would expect that following command yield the same result as in the article:

.\prank.bat eval-predict .\..\coach420(mlig).ds

but there is some difference: 71.2 and 75.7 compared to 72.0 78.3 .
Or is just by some bug fixing etc... and these result can be considered to be reasonably similar?

rdk · Answer 3 · Wed May 27 2020 19:49:29 GMT+0800 (China Standard Time)

It is certainly possible that such difference could be caused by bug fixing, however in this case I think you might be comparing two different things. coach420.ds and coach420(mlig).ds are effectively two different datasets. The main text of the article reports results only on coach420.ds. Results on coach420(mlig).ds ale included only in Supplementary (Additional File 1) / Table 1 and are: 71.2 76.5.
This is significantly closer but there is still difference in DCA_4_2. Hard to say where it is coming from. It could be bug fixing or a new bug, possibly some parsing error in a new version of BioJava. Is there any ERROR in the log? You can alsoi try to run it with -fail_fast 1 and check if/where it fails.

Petr Škoda · Answer 4 · Wed May 27 2020 20:12:15 GMT+0800 (China Standard Time)

71.2, 75.7 and 71.2, 76.5 seem reasonably close to me - should be reproducible, but as you mentioned there might be some changes in the library etc..