cvignac / DiGress

Hi,

I notice the number of molecules to generate for evaluation on MOSES dataset is 25000, as specified in the config file.

DiGress/configs/experiment/moses.yaml

Line 17 in 150ca14

final_model_samples_to_generate: 25000

The number of molecues are also 25000 in your shared SMILES samples: https://github.com/cvignac/DiGress/blob/main/generated_samples/generated_smiles_moses.txt.

However, the original MOSES paper suggests using 30000 generated samples for evaluation.
Snapshot:

Source: https://arxiv.org/pdf/1811.12823.pdf#page=3

I'm new to this dataset and feel confused about the discrepancy. Can you explain why we choose 25000 instead of 30000?

Thanks,
Qi

If you check the code of MOSES, I think that internally it uses 20000 valid samples to compute metrics. Since we can get enough valid molecules by sampling 25k molecules, we did not sample more.

Got it. Thanks!

Discrepancy for MOSES dataset evaluation protocol