jerryji1993 / DNABERT

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Home Page:https://doi.org/10.1093/bioinformatics/btab083

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can you use the pre-trained BERT models, but add novel tokens to the vocabulary?

mepster opened this issue · comments

Can you use the pre-trained BERT models, but add novel tokens to the vocabulary during fine-tuning? Any tips on what's needed for this?

Or during fine-tuning MUST you use the same vocab.txt file that was used in pre-training?

I want to add some of the IUPAC symbols, for example the symbol Y which means "T or C". So that will expand my vocabulary a lot.

But I don't have the resources to retrain.

Related, but I believe talking about training from scratch:
#81