Accuracy metric in the pre-training stage

Question

Accuracy metric in the pre-training stage

yspaik opened this issue 4 years ago · comments

Paik, Young Sang (Keanu) commented 4 years ago

This is a question about smiles pretraining. Do you have any metric measurements to determine whether pretraining is going well without any problems? In the short paper in the README description, only the results after fine tuning with tox21 are displayed. Do you have an accuracy measure of how many masks are matched in the unsupervised training stage (pre-training) ? If so, how does the accuracy vary depending on the representation type (SMILES-BPE, SMILES, SELFIES-BPE etc)?

Seyone Chithrananda · Answer 1 · Wed Oct 07 2020 12:18:33 GMT+0800 (China Standard Time)

Great question! We're currently wrapping up a preprint which will be made public soon (mid to end-October). We share results regarding comparisons of different tokenization strategies, molecular string representations, as well as the effects of pre-training on progressively larger molecular datasets. Stay tuned, we will also open-source the code then!

With regards to pre-training, we evaluate primarily using train and eval_loss on the MLM predictions to ensure training is going ok (also to check for overfitting, etc). HuggingFace provides pretty good utilities to configure these runs automatically with Weights and Biases in the transformers library, so I would check out their documentation.

Seyone Chithrananda · Answer 2 · Fri Oct 09 2020 13:58:28 GMT+0800 (China Standard Time)

@yspaik We're releasing the first draft of the paper on Arxiv this weekend, will link here! Many more updates to come, including SMILES tokenizer pre-training notebooks, SELFIES pre-training notebooks and data loaders.

Paik, Young Sang (Keanu) · Answer 3 · Sat Oct 10 2020 18:43:29 GMT+0800 (China Standard Time)

@seyonechithrananda Great news! I'll read it carefully

Seyone Chithrananda · Answer 4 · Wed Oct 21 2020 12:47:37 GMT+0800 (China Standard Time)

Closing issue as Arxiv paper is released