Are the data split exactly the same as those in GLN and G2Gs?

Question

Are the data split exactly the same as those in GLN and G2Gs?

iamxpy opened this issue 3 years ago · comments

The results of baselines in your paper are copied directly from the GLN and G2Gs paper. But I am wondering if you reuse the same training/validaion/test set? Or did you just randomly split the 50K reactions into 80%/10%/10%? Thanks.

pyxiea · Answer 1 · Sat Mar 27 2021 22:08:17 GMT+0800 (China Standard Time)

After checked the file 'megan/data/uspto_50k/default_split.csv', and looked through the code in 'megan/src/datasets/uspto_50k.py', it seems that:
(1) The number of reactions in training/validation/test set are different from those in GLN and G2Gs.
(2) This line in uspto_50k.py will generate different training/validation/test set on every run.

Personally speaking, I think it not fair to compare the results on a different test set (especially with different size) to the results reported in GLN paper. Am I missing something? Looking forward to your reply!

Mikolaj Sacha · Answer 2 · Thu Apr 01 2021 21:54:39 GMT+0800 (China Standard Time)

Thanks for your comments. You are correct that we generate the split ourselves and compare results to models trained and evaluated on different splits. Perhaps this fact is not stated explicitly enough in the paper.

However, when we tried to find a common benchmark USPTO-50k version with a train/valid/test split, we discovered that authors of the previous papers also generated splits on their own.
For instance, retrosim and GLN use split generated by the authors of retrosim (in https://github.com/connorcoley/retrosim/blob/master/retrosim/data/get_data.py)
Transformer and seq2seq, on the other hand, use split generated by the authors of seq2seq.
As far as we know, the authors of G2Gs did not share their code but they state in their paper: "we randomly select 80% of the reactions as training set and divide the rest into validation and test sets with equal size" which indicates that they also generated the split themselves. The same goes for the authors of GraphRetro, who write: "Following prior work, we divide the
dataset randomly in an 80:10:10 split for training, validation and testing".

You are right that not having the exact same data split can make the comparison of different methods unfair. We would love to be able to compare all the methods on the exact same data but, unfortunately, we did not have resources to do so, as it would practically require for us to train and evaluate all compared methods on our data split.

Mikolaj Sacha · Answer 3 · Fri Apr 02 2021 21:00:43 GMT+0800 (China Standard Time)

Addressing your second concern: this line sets random seed in numpy, which is used for generating random split, so the generated split is the same each run.

pyxiea · Answer 4 · Tue Apr 06 2021 20:35:14 GMT+0800 (China Standard Time)

Thanks for your comments. You are correct that we generate the split ourselves and compare results to models trained and evaluated on different splits. Perhaps this fact is not stated explicitly enough in the paper.

However, when we tried to find a common benchmark USPTO-50k version with a train/valid/test split, we discovered that authors of the previous papers also generated splits on their own.
For instance, retrosim and GLN use split generated by the authors of retrosim (in https://github.com/connorcoley/retrosim/blob/master/retrosim/data/get_data.py)
Transformer and seq2seq, on the other hand, use split generated by the authors of seq2seq.
As far as we know, the authors of G2Gs did not share their code but they state in their paper: "we randomly select 80% of the reactions as training set and divide the rest into validation and test sets with equal size" which indicates that they also generated the split themselves. The same goes for the authors of GraphRetro, who write: "Following prior work, we divide the
dataset randomly in an 80:10:10 split for training, validation and testing".

You are right that not having the exact same data split can make the comparison of different methods unfair. We would love to be able to compare all the methods on the exact same data but, unfortunately, we did not have resources to do so, as it would practically require for us to train and evaluate all compared methods on our data split.

Thanks for your reply. What worries me the most is the different size of test set. Every prior work splits the USPTO-50k dataset into 40008:5001:5007, including G2Gs (I can confirm this because the authors have sent me the unclean code after I emailed them). However, your implementation using np.random.choice(p=[0.8,0.1,0.1], ...) can only obtain a distribution close to 8:1:1, the size of every split is different from all prior works. It is worth noting that your concurrent work RetroXpert (accepted by NIPS 2020) also use the same data split as GLN and retrosim.

Mikolaj Sacha · Answer 5 · Wed May 26 2021 17:56:22 GMT+0800 (China Standard Time)

Hi @iamxpy ,
just for your information, we have just updated our paper on arxiv. In this update we changed the dataset split for USPTO-50k, so it is the same as in the GLN paper. The metrics for the new split are similar as before but it should be a fairer comparison. Thanks for pointing out this issue.
Mikołaj

Shuan Chen · Answer 6 · Mon May 31 2021 15:11:37 GMT+0800 (China Standard Time)

Can you share the results (all predictions) on the trained model on the test set with this split predicted by megan? Thank you.