seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Details about curating pubchem dataset

taew0361 opened this issue · comments

Thank you for publishing this great work!. I have a question about the pubchem dataset, using as a pretraining set.

In this arxiv paper, it is shortly mentioned that the 77M pubchem dataset is curated to the 10M pubchem data.

Could you explain a bit more about the details how to curate the 77M pubchem dataset?

ex) Smiles with nonbonding is removed