seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

how to use SELFIES with ChemBERTa

yspaik opened this issue · comments

Hi, I am Keanu and very much interested in AI driven drug discovery. I found you in the huggingface model hub and very impressed of what you have done. I'd like to use your pretrained network leveraging SELFIES parser. I look around the configuration and vocab, but I am little bit confusing.

This is a question about how to use the SELFIES parser. In the case of the huggingface wordpiece tokenizer, it is split first by the basic tokenizer and then divided into wordpieces by the BPE. In the case of the SMILES tokenizer that you implemented in Deepchem, SMILES is first parsed in atom units by the basic tokenizer. , It seems that it does not split additionally into word pieces by BPE. On the other hand, as I roughly looked at seyonec/BPE_SELFIES_PubChem_shard00_120k, I guess, it seems that the SELFIES parser first parses and then creates a second sub token with BPE. Is that right? If yes, is there any reason for doing that?

When will we get an example of dealing with chemBERTa with SELFIES in tutorial format? I tried to use your SELFIES pretrained model mentioned above by referring to deepchem's SMILES tokenizer, but it keeps failing :-)

Hi Keanu, thanks for reaching out and being interested in our work!

We're providing clearer details + documentation with how to interface with the numerous tokenizers + string representations we employ in an ArXiv paper coming out this week. The full library will come out then, and I'll be happy to share more about using SELFIES with BPE. Will ping this Issue when that is out. Feel free to remind me if I don't reply by next week :)

Ah yes thank you for your quick reply.
That is a good news. I will read it with pleasure. :-)

@seyonechithrananda Sorry for the rush. Recognizing your very busy schedule, I'd like you to share the ArXiv paper link when you release it.

Arxiv paper out now, Twitter post and much more code will be finally open-sourced soon! I definitely didn't anticipate all the attention this has been getting so excited to finally share more! Arxiv