Details about curating pubchem dataset
taew0361 opened this issue · comments
Thank you for publishing this great work!. I have a question about the pubchem dataset, using as a pretraining set.
In this arxiv paper, it is shortly mentioned that the 77M pubchem dataset is curated to the 10M pubchem data.
Could you explain a bit more about the details how to curate the 77M pubchem dataset?
ex) Smiles with nonbonding is removed