daulet / tokenizers

Go bindings for HuggingFace Tokenizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenizers

Go bindings for the HuggingFace Tokenizers library.

Installation

make build to build libtokenizers.a that you need to run your application that uses bindings.

Using pre-built binaries

Build your Go application using pre-built native binaries: docker build --platform=linux/amd64 -f example/Dockerfile .

Available binaries:

Getting started

TLDR: working example.

Load a tokenizer from a JSON config:

import "github.com/daulet/tokenizers"

tk, err := tokenizers.FromFile("./data/bert-base-uncased.json")
if err != nil {
    return err
}
// release native resources
defer tk.Close()

Encode text and decode tokens:

fmt.Println("Vocab size:", tk.VocabSize())
// Vocab size: 30522
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", false))
// [2829 4419 14523 2058 1996 13971 3899] [brown fox jumps over the lazy dog]
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", true))
// [101 2829 4419 14523 2058 1996 13971 3899 102] [[CLS] brown fox jumps over the lazy dog [SEP]]
fmt.Println(tk.Decode([]uint32{2829, 4419, 14523, 2058, 1996, 13971, 3899}, true))
// brown fox jumps over the lazy dog

Benchmarks

go test . -bench=. -benchmem -benchtime=10s

goos: darwin
goarch: arm64
pkg: github.com/daulet/tokenizers
BenchmarkEncodeNTimes-10        996556      11851 ns/op      116 B/op        6 allocs/op
BenchmarkEncodeNChars-10      1000000000      2.446 ns/op        0 B/op        0 allocs/op
BenchmarkDecodeNTimes-10       7286056       1657 ns/op      112 B/op        4 allocs/op
BenchmarkDecodeNTokens-10     65191378      211.0 ns/op        7 B/op        0 allocs/op
PASS
ok   github.com/daulet/tokenizers 126.681s

Contributing

Please refer to CONTRIBUTING.md for information on how to contribute a PR to this project.

About

Go bindings for HuggingFace Tokenizer

License:MIT License


Languages

Language:Go 52.8%Language:Rust 18.6%Language:Starlark 12.9%Language:Python 6.7%Language:Makefile 4.0%Language:Dockerfile 2.9%Language:C 2.2%