seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Accuracy metric in the pre-training stage

yspaik opened this issue · comments

This is a question about smiles pretraining. Do you have any metric measurements to determine whether pretraining is going well without any problems? In the short paper in the README description, only the results after fine tuning with tox21 are displayed. Do you have an accuracy measure of how many masks are matched in the unsupervised training stage (pre-training) ? If so, how does the accuracy vary depending on the representation type (SMILES-BPE, SMILES, SELFIES-BPE etc)?

Great question! We're currently wrapping up a preprint which will be made public soon (mid to end-October). We share results regarding comparisons of different tokenization strategies, molecular string representations, as well as the effects of pre-training on progressively larger molecular datasets. Stay tuned, we will also open-source the code then!

With regards to pre-training, we evaluate primarily using train and eval_loss on the MLM predictions to ensure training is going ok (also to check for overfitting, etc). HuggingFace provides pretty good utilities to configure these runs automatically with Weights and Biases in the transformers library, so I would check out their documentation.

@yspaik We're releasing the first draft of the paper on Arxiv this weekend, will link here! Many more updates to come, including SMILES tokenizer pre-training notebooks, SELFIES pre-training notebooks and data loaders.

@seyonechithrananda Great news! I'll read it carefully

Closing issue as Arxiv paper is released