There are 8 repositories under tokenizer topic.
Parser Building Toolkit for JavaScript
Persian NLP Toolkit
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Open Korean Text Processor - An Open-source Korean Text Processor
Online playground for OpenAPI tokenizers
:herb: NodeJS PHP Parser - extract AST or tokens
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin
Python port of Moses tokenizer, truecaser and normalizer
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Lex machinary for go.
JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.
A multilingual morphological analysis library.
VSCode extension to highlight nested code blocks
🎤 vibrato: Viterbi-based accelerated tokenizer
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer