google / seqio

Task-based datasets, preprocessing, and evaluation for sequence models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HuggingFace Tokenizers compatibility

gabeorlanski opened this issue · comments

Hi, I have been trying to get SeqIO to work with HuggingFace's tokenizers for a bit but have been running into trouble with non-t5 based tokenizers. Specifically, it seems that, because they are not sentencepiece tokenizers, tokenizers for models such as GPT-2 are incompatible with SeqIO's SentencePieceVocabulary as they only have the vocab files:

{
  'vocab_file': 'vocab.json',
  'merges_file': 'merges.txt',
  'tokenizer_file': 'tokenizer.json'
}

Is there a currently supported way to use these tokenizers with SeqIO? Or would I need to make my own vocab class?

You can make your own subclass of seqio.Vocabulary that provides this compatibility. It would be an excellent contribution to the codebase!

+1 this would be great to have!

Hey, i'm about to implement this in the near future (and hopefully make a pull request).
Specifically for the GPT-2 tokenizer, but it doesn't really matter.
Are there any things/pitfall I should look out for?