Details about curating pubchem dataset

Question

taew0361 opened this issue 2 years ago · comments

Thank you for publishing this great work!. I have a question about the pubchem dataset, using as a pretraining set.

In this arxiv paper, it is shortly mentioned that the 77M pubchem dataset is curated to the 10M pubchem data.

Could you explain a bit more about the details how to curate the 77M pubchem dataset?

ex) Smiles with nonbonding is removed