jacobaustin123 / tokenizer

BPE tokenization implemented in Golang πŸ’™

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenizers License

Cohere's tokenizers library provides an interface to encode and decode text given a computed vocabulary, and includes pre-computed tokenizers that are used to train Cohere's models.

We plan on eventually also open sourcing tools to create new tokenizers.

Example using Go

Choose a tokenizer inside of the vocab folder including both a encoder.json file and a vocab.bpe file and create an encoder as seen below. The tokenizer used in this example is named the coheretext-50k tokenizer.

import (
  ...
  "github.com/cohere-ai/tokenizer"
)

encoder := tokenizer.NewFromPrebuilt("coheretext-50k")

To encode a string of text, use the Encode method. Encode returns a slice of int64s.

encoded := encoder.Encode("this is a string to be encoded")
fmt.Printf("%v", encoded)
// [6372 329 258 3852 288 345 37754]

To decode a slice of int64s, use the Decode method. Decode returns a string.

fmt.Printf(encoder.Decode(encoded))
// this is a string to be encoded

Speed

Using a 2.5GHz CPU, encoding 1000 tokens takes approximately 6.5 milliseconds, and decoding 1000 tokens takes approximately 0.2 milliseconds.

About

BPE tokenization implemented in Golang πŸ’™

License:Apache License 2.0


Languages

Language:Go 100.0%