gomlx / tokenizers

Tokenizers for Language Models - Go API for HuggingFace Tokenizers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenizers for Go

Tokenizers for Language Models - Go API for HuggingFace Tokenizers

Highlights

Important

TODO: nothing implemented yet.

  • Allow customization to various LLMs, exposing most of the functionality of the HuggingFace Tokenizers library.
  • Provide a from_pretrained API, that downloads parameters to various known models -- levaraging HuggingFace Hub

Installation

This library is a wrapper around the Rust implementation by HuggingFace, and it requires the compiled Rust code available as a libgomlx_tokenizers.a.

To make that easy, the project provides a prebuilt libgomlx_tokenizers.a in the git repository (for the popular platforms), so for many nothing is needed (except having CGO enabled -- for cross-compilation set CGO_ENABLED=1), and it can be simply included as any other Go library.

If you want to build the underlying Rust wrapper and dependencies yourselves for any reason (including maybe to add support for a different platform), it uses the Mage build system -- an improved Makefile-like that uses Go.

If you create a new rule for a different platform, please consider contributing it back 😄

Important

TODO

Thank You

Questions

Why fork and not collaborate with an already existing tokenizers project ?

I plan to revamp how the library is organized, its "ergonomics" to be more aligned with GoMLX APIs, and add documentation. I will also expand the functionality to match (as much as I'm able to do) HuggingFace's library. All this will completely break the API of the original repositories, and I felt too much to ask from the original authors.

About

Tokenizers for Language Models - Go API for HuggingFace Tokenizers

License:MIT License


Languages

Language:Go 74.4%Language:Rust 18.3%Language:C 5.0%Language:Makefile 1.7%Language:Dockerfile 0.7%