Tokenizers

Under development. This library is NOT ready for production and the author doesn't have enough time and motivation to continue development the library at least for now.

A Swift binding for HuggingFace Tokenizers library.

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

How to run

We don't publish this package yet. You can try it with swift repl.

First, you have to build the FFI library and generate scaffolding Swift files.

$ make release

You can also build debug target by running make build, but the performance of the library is too slow to train a model.

Then, running swift repl with linker option to link with the library.

$ swift run --repl -Xlinker="-Ltarget/release"

Quick Example

Examples can be found under the example directory. You can run each example with swift run command:

$ make release
$ cd example 
$ swift run -Xlinker -L../target/release PretrainedTokenizerExample
Building for debugging...
[2/2] Compiling PretrainedTokenizerExample Example.swift
Build complete! (0.36s)
tokens = ["[CLS]", "Hey", "there", "!", "[SEP]"]

Loading a pretrained tokenizer from the Hub

import Tokenizers

@main
public struct Example {
    public static func main() {
        let tokenizer = try! Tokenizer(pretrained: "bert-base-cased")
        let encoding = try! tokenizer.encode("Hey there!")

        print("tokens = \(encoding.tokens)")
    }
}

Training and serialization

Like the original quick tour, we'll show you how to build a tokenizer from scratch by using tokenizer-swift.

In this example, you can train a new tokenizer on wikitext-103 in just a few seconds (hopefully). First things first, you will need to download this dataset and unzip it with:

mkdir data
cd data
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
cd ../example

Now you can train the tokenizer:

$ swift run -Xlinker -L../target/release TrainingTokenizerExample
[00:00:09] Pre-processing files (543 Mo)            ██████████████████████████████████████████████████████████████████████████████████████████████████████████                100%
[00:00:00] Tokenize words                           ██████████████████████████████████████████████████████████████████████████████████████████████████████████ 610142   /   610142
[00:00:03] Count pairs                              ██████████████████████████████████████████████████████████████████████████████████████████████████████████ 610142   /   610142
[00:00:02] Compute merges                           ██████████████████████████████████████████████████████████████████████████████████████████████████████████ 24989    /    24989

Reload the trained tokenizer...
["Hello", ",", "y", "\'", "all", "!", "How", "are", "you", "[UNK]", "?"]

The source code is here.

import Tokenizers

@main
public struct TrainingTokenizerExample {
    public static func main() throws {
        let tokenizer = try Tokenizer(model: BPE(unkToken: "[UNK]"))
        let trainer = try BPETrainer(specialTokens: [
            "[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]",
        ])

        tokenizer.preTokenizer = Whitespace()

        // Train the tokenizer.
        let files = ["test", "train", "valid"].map { "../data/wikitext-103-raw/wiki.\($0).raw" }
        try tokenizer.train(files: files, trainer: trainer)

        // Save the tokenizer in one file.
        try tokenizer.save(to: "../data/tokenizer-wiki.json")

        // Reload the tokenizer
        print("Reload the trained tokenizer...")
        let savedTokenizer = try Tokenizer(contentsOfFile: "../data/tokenizer-wiki.json")

        // Using the tokenizer
        let output = try savedTokenizer.encode("Hello, y'all! How are you 😁 ?")

        print(output.tokens)
        // => ["Hello", ",", "y", "\'", "all", "!", "How", "are", "you", "[UNK]", "?"]
    }
}

sinnuswong / tokenizers-swift

Tokenizers

How to run

Quick Example

Loading a pretrained tokenizer from the Hub

Training and serialization

About

Languages