There are 14 repositories under tokenizer topic.
A small library for converting tokenized PHP source code into XML (and potentially other formats)
Parser Building Toolkit for JavaScript
Online playground for OpenAPI tokenizers
Persian NLP Toolkit
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (gpt-5, gpt-o*, gpt-4o, etc.). Port of OpenAI's tiktoken with additional features.
Open Korean Text Processor - An Open-source Korean Text Processor
Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
:herb: NodeJS PHP Parser - extract AST or tokens
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Python port of Moses tokenizer, truecaser and normalizer
VSCode extension to highlight nested code blocks
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING
[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding
Lex machinary for go.
🎤 vibrato: Viterbi-based accelerated tokenizer
Ready-made tokenizer library for working with GPT and tiktoken