There are 19 repositories under tokenization topic.
LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/
Easy token price estimates for 400+ LLMs. TokenOps.
Secure Vault for Customer PII/PHI/PCI/KYC Records
Ravencoin Core integration/staging tree
Unsupervised text tokenizer focused on computational efficiency
👑 spaCy building blocks and visualizers for Streamlit apps
All the slides, accompanying code and exercises all stored in this repo. 🎈
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language
🎤 vibrato: Viterbi-based accelerated tokenizer
Sudachi in Rust 🦀 and new generation of SudachiPy
CodeChain's official implementation in Rust.
TokenScript schema, specs and paper
OmniTokenizer: one model and one weight for image-video joint tokenization.
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.
Minimal, OpenSSL-less and super lightweight JWT library written in C.
Implementation of the GBST block from the Charformer paper, in Pytorch
Code for Zero-Shot Tokenizer Transfer
使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。