HuggingFace Tokenizers compatibility
gabeorlanski opened this issue · comments
Hi, I have been trying to get SeqIO to work with HuggingFace's tokenizers for a bit but have been running into trouble with non-t5 based tokenizers. Specifically, it seems that, because they are not sentencepiece tokenizers, tokenizers for models such as GPT-2 are incompatible with SeqIO's SentencePieceVocabulary
as they only have the vocab files:
{
'vocab_file': 'vocab.json',
'merges_file': 'merges.txt',
'tokenizer_file': 'tokenizer.json'
}
Is there a currently supported way to use these tokenizers with SeqIO? Or would I need to make my own vocab class?
You can make your own subclass of seqio.Vocabulary that provides this compatibility. It would be an excellent contribution to the codebase!
+1 this would be great to have!
Hey, i'm about to implement this in the near future (and hopefully make a pull request).
Specifically for the GPT-2 tokenizer, but it doesn't really matter.
Are there any things/pitfall I should look out for?