Consider interop with huggingface's tokenizers

Question

Consider interop with huggingface's tokenizers

dginev opened this issue 5 years ago · comments

Huggingface maintain a Rust tokenization library compatible with their language model pipelines. It should be worth investigating how to interoperate with that experimental flow, as well as to see if I can leverage their approach for a Python wrapper also for the llamapun abstractions.

Deyan Ginev · Answer 1 · Wed Nov 25 2020 22:31:32 GMT+0800 (China Standard Time)

The right place for huggingface's tokenizers would be after we do our own math-aware preprocessing and wouldn't really play any part in serializing a "token model" plain text file, which is the current endpoint of using llamapun. Once that plain text is read-in for a specific modeling framework, it needs to be retokenized as per the model requirements (e.g. 2 million distinct tokens if one uses GloVe/word2vec for arXiv, but only 30 thousand tokens if one uses subword tokenization). So huggingface's tokenization is probably a step to use after one is done with preprocessing via llamapun.

And as things stand, the huggingface/keras ecosystem (or some of their competitors) are so convenient that llamapun should really act as a math-aware preprocessing library, and leave off the actual modeling to something else.