There are 0 repository under tokenizers topic.
Building applications with LLMs through composability, in Kotlin
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
This repository is part of a course on Elasticsearch in Python. It includes notebooks that demonstrate its usage, along with a YouTube series to guide you through the material.
Develop DL models using Pytorch and Hugging Face
the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly
This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.
Python script for manipulating the existing tokenizer.
Use custom tokenizers in spacy-transformers
[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library
Package to align tokens from different tokenizations.
Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels
Small library that provides functions to tokenize a string into an array of words with or without punctuation
A graphical user interface for the Elasticsearch Analyze API
Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.
Visualize some important concepts related to LLM architectures.
A high-performance tokenizer built to rival GPT-4, trained on the C4 dataset.
Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language
ML Model designed to learn compositional structure of LEGO assemblies
Self-containing notebooks to play simply with some particular concepts in Deep Learning
Fine tuning pre-trained transformer models in TensorFlow and in PyTorch for question answering
Explore how Hugging Face tokenizers work across models like LLaMA, PHI-3, and StarCoder2. Includes examples for encoding, decoding, chat formatting, and token visualization. Ideal for understanding text preprocessing in LLMs.
Question and Answer web applicaiton using fine-tuned and pre-trained T5 models. Application runs on Streamlit.
Recreating every milestone in Machine Learning and Artificial Intelligence
Create prompts with a given token length for testing LLMs and other transformers text models.
Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library
Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.
Optimized implementation for Byte-Pair encoding algorithm that can process billions of words in few minutes in medium-resources computer
is a comprehensive, educational project dedicated to building a Large Language Model (LLM) from the ground up. It serves as the official code repository for the book Build a Large Language Model (From Scratch), guiding developers step-by-step through the process of developing, pretraining, finetuning, and aligning a GPT-like LLM using PyTorch.
🖋️ A sleek, BPE-powered tokenizer that understands the richness of Marathi.
a vector database + embedding model written from scratch in go