KWARC / llamapun

common language and mathematics processing algorithms, in Rust

Home Page:https://kwarc.info/systems/llamapun/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider interop with huggingface's tokenizers

dginev opened this issue · comments

Huggingface maintain a Rust tokenization library compatible with their language model pipelines. It should be worth investigating how to interoperate with that experimental flow, as well as to see if I can leverage their approach for a Python wrapper also for the llamapun abstractions.

The right place for huggingface's tokenizers would be after we do our own math-aware preprocessing and wouldn't really play any part in serializing a "token model" plain text file, which is the current endpoint of using llamapun. Once that plain text is read-in for a specific modeling framework, it needs to be retokenized as per the model requirements (e.g. 2 million distinct tokens if one uses GloVe/word2vec for arXiv, but only 30 thousand tokens if one uses subword tokenization). So huggingface's tokenization is probably a step to use after one is done with preprocessing via llamapun.

And as things stand, the huggingface/keras ecosystem (or some of their competitors) are so convenient that llamapun should really act as a math-aware preprocessing library, and leave off the actual modeling to something else.