HuggingFace Tokenizers compatibility

Question

HuggingFace Tokenizers compatibility

gabeorlanski opened this issue 2 years ago · comments

Hi, I have been trying to get SeqIO to work with HuggingFace's tokenizers for a bit but have been running into trouble with non-t5 based tokenizers. Specifically, it seems that, because they are not sentencepiece tokenizers, tokenizers for models such as GPT-2 are incompatible with SeqIO's SentencePieceVocabulary as they only have the vocab files:

{
  'vocab_file': 'vocab.json',
  'merges_file': 'merges.txt',
  'tokenizer_file': 'tokenizer.json'
}

Is there a currently supported way to use these tokenizers with SeqIO? Or would I need to make my own vocab class?

Adam Roberts · Answer 1 · Wed Feb 02 2022 04:00:40 GMT+0800 (China Standard Time)

You can make your own subclass of seqio.Vocabulary that provides this compatibility. It would be an excellent contribution to the codebase!

Gaurav Mishra · Answer 2 · Wed Feb 02 2022 10:59:49 GMT+0800 (China Standard Time)

+1 this would be great to have!

OhadRubin · Answer 3 · Tue May 31 2022 22:31:54 GMT+0800 (China Standard Time)

Hey, i'm about to implement this in the near future (and hopefully make a pull request).
Specifically for the GPT-2 tokenizer, but it doesn't really matter.
Are there any things/pitfall I should look out for?