daac-tools's repositories
find-simdoc
Finding all pairs of similar documents time- and memory-efficiently
python-vibrato
Viterbi-based accelerated tokenizer (Python wrapper)
trie-match
Fast match expression optimized for string comparison
python-vaporetto
π₯ Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
python-daachorse
π A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)
include-bytes-zstd
Includes a file with zstd compression in Rust
guidelines
Guidelines for daac-tools community
vaporetto-models
Tokenization models and training scripts for Vaporetto fast tokenizer