uta-smile / RetroXpert

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to get the accuracy of two phases & any tips to reproduce the results close to those in paper?

iamxpy opened this issue · comments

How to score the whole method? I ran bash score.sh and it gave me the following result:

Top-1: 68.1% || Invalid SMILES 3.00%
Top-2: 78.0% || Invalid SMILES 23.09%
Top-3: 80.9% || Invalid SMILES 34.75%
Top-4: 82.0% || Invalid SMILES 41.08%
Top-5: 82.7% || Invalid SMILES 44.96%
Top-6: 83.1% || Invalid SMILES 46.99%
Top-7: 83.5% || Invalid SMILES 48.21%
Top-8: 83.7% || Invalid SMILES 50.85%
Top-9: 84.1% || Invalid SMILES 51.89%
Top-10: 84.2% || Invalid SMILES 50.71%
Top-11: 84.4% || Invalid SMILES 52.85%
Top-12: 84.6% || Invalid SMILES 53.17%
Top-13: 84.8% || Invalid SMILES 53.27%
Top-14: 85.0% || Invalid SMILES 53.96%
Top-15: 85.1% || Invalid SMILES 55.76%
Top-16: 85.2% || Invalid SMILES 54.78%
Top-17: 85.3% || Invalid SMILES 55.92%
Top-18: 85.4% || Invalid SMILES 56.16%
Top-19: 85.5% || Invalid SMILES 56.44%
Top-20: 85.6% || Invalid SMILES 57.40%
Top-21: 85.6% || Invalid SMILES 57.60%
Top-22: 85.6% || Invalid SMILES 57.18%
Top-23: 85.7% || Invalid SMILES 56.24%
Top-24: 85.8% || Invalid SMILES 58.10%
Top-25: 85.8% || Invalid SMILES 58.46%
Top-26: 85.9% || Invalid SMILES 58.50%
Top-27: 86.0% || Invalid SMILES 57.46%
Top-28: 86.1% || Invalid SMILES 59.18%
Top-29: 86.1% || Invalid SMILES 58.98%
Top-30: 86.2% || Invalid SMILES 58.50%
Top-31: 86.2% || Invalid SMILES 58.42%
Top-32: 86.3% || Invalid SMILES 57.32%
Top-33: 86.3% || Invalid SMILES 58.52%
Top-34: 86.4% || Invalid SMILES 57.64%
Top-35: 86.4% || Invalid SMILES 60.16%
Top-36: 86.4% || Invalid SMILES 59.50%
Top-37: 86.5% || Invalid SMILES 58.92%
Top-38: 86.5% || Invalid SMILES 58.18%
Top-39: 86.6% || Invalid SMILES 58.16%
Top-40: 86.6% || Invalid SMILES 59.12%
Top-41: 86.7% || Invalid SMILES 60.77%
Top-42: 86.7% || Invalid SMILES 59.76%
Top-43: 86.7% || Invalid SMILES 59.56%
Top-44: 86.7% || Invalid SMILES 58.84%
Top-45: 86.8% || Invalid SMILES 60.54%
Top-46: 86.8% || Invalid SMILES 60.79%
Top-47: 86.8% || Invalid SMILES 60.44%
Top-48: 86.8% || Invalid SMILES 62.55%
Top-49: 86.8% || Invalid SMILES 62.53%
Top-50: 86.9% || Invalid SMILES 64.81%
**--Second phase Top1 acc =  0.6814459756341122**

It only reported the result of the second phrase. I inspected the code in score_prediction.py and guess that we should pass the file reaction_center_preds, namely the prediction result of the first step, to the script. But I am not sure which file to use. And the reported top-1 accuracy of the second phase is not as high as the score reported in the paper (0.734), is this expected? If not, any idea what may cause the lower acc?

commented

bash score.sh gives you the overall accuracy. Since the input is the prediction from the first phrase, so the second phase accuracy here is actually the overall accuracy.

If you carefully read the README, you will find that we have some optimizations to improve the first phase, however, it did not improve the overall accuracy. I guess the reason might be the error data augmentation for the second phase is less than before. If you want to get the results close to the paper, you may remove these optimizations manually, or you may choose a checkpoint that generates worse results for the first phase.

bash score.sh gives you the overall accuracy. Since the input is the prediction from the first phrase, so the second phase accuracy here is actually the overall accuracy.

If you carefully read the README, you will find that we have some optimizations to improve the first phase, however, it did not improve the overall accuracy. I guess the reason might be the error data augmentation for the second phase is less than before. If you want to get the results close to the paper, you may remove these optimizations manually, or you may choose a checkpoint that generates worse results for the first phase.

That clarify things, and I will try what you suggested, thanks!