tokenizers

There are 0 repository under tokenizers topic.

xebia-functional / xef
Building applications with LLMs through composability, in Kotlin
ai kotlin llm multiplatform scala artificial-intelligence agents chatgpt-api embeddings functional-programming openai tokenizers
Language:Kotlin 194
LongRoPE
jshuadvd / LongRoPE
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
ai artificial-intelligence deep-learning llm machine-learning gpt language-model natural natural-language-processing natural-language-understanding nlp tokenization tokenizers transformers attention-is-all-you-need attention-mechanisms transformer-architecture natural-language-inference natural-language-procressing
Language:Python 152
ElasticSearch_Python_Course
ImadSaddik / ElasticSearch_Python_Course
This repository is part of a course on Elasticsearch in Python. It includes notebooks that demonstrate its usage, along with a YouTube series to guide you through the material.
elasticsearch embeddings knn-algorithm search-engine semantic-search elastic hybrid-search tokenizers
Language:Jupyter Notebook 77
Arunprakash-A / DL-Pytorch-Workshop
Develop DL models using Pytorch and Hugging Face
pytorch workshop datasets dl hf tokenizers transformers
41
Prismadic / magnet
the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly
embeddings fine-tuning finetuning-llms llm-training sentence-splitting tokenizers langchain mistral milvus nats nats-messaging nats-streaming apple-silicon huggingface inference-api mlx distributed-computing distributed-systems claude gemini
Language:Python 31
sayakpaul / count-tokens-hf-datasets
This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.
apache-beam dataflow tokenizers transformers hf-datasets unigram-tokenization
Language:Python 27
gweidart / rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers
Language:Python 22
1kkiRen / Tokenizer-Changer
Python script for manipulating the existing tokenizer.
delete tokenizers tokens
Language:Python 20
megagonlabs / ginza-transformers
Use custom tokenizers in spacy-transformers
nlp natural-language-processing spacy spacy-transformers transformers sudachitra tokenizers ginza
Language:Python 16
sappho192 / Tokenizers.DotNet
[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library
csharp dotnet huggingface library nuget rust tokenizers
Language:C# 16
symanto-research / merge-tokenizers
Package to align tokens from different tokenizations.
distance tokenizers tokens transformers
Language:Python 12
Anush008 / tokenizers
Multi-arch bindings for @huggingface/tokenizers.
huggingface tokenizers
Language:Rust 11
Hugging-Face-Supporter / tftokenizers
Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels
bert nlp tensorflow transformers natural-language-processing tensorflow-hub sentencepie tokenizer tokenizers
Language:Python 10
unfoldingWord / string-punctuation-tokenizer
Small library that provides functions to tokenize a string into an array of words with or without punctuation
javascript nlp scripture-open-components nlp-library segmentation tokenizers
Language:JavaScript 8
arturom / search-analysis
A graphical user interface for the Elasticsearch Analyze API
elasticsearch react text-analysis analyze-api analyzers filters tokenizers
Language:JavaScript 6
Beomi / megatronlm_dataset_autotokenizer
Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.
gpt-neox megatron-lm tokenizers transformers
Language:Python 6
mickymultani / LLM-Architecture
Visualize some important concepts related to LLM architectures.
attention-mechanism huggingface huggingface-transformers llm llm-inference tokenizers transformers llm-architecture
Language:Jupyter Notebook 6
wassemgtk / SuperTokenizer
A high-performance tokenizer built to rival GPT-4, trained on the C4 dataset.
tokenizer tokenizer-framework tokenizers
Language:Jupyter Notebook 5
cobanov / turkish-bpe-tokenizer
Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language
bpe tokenizer tokenizers turkish turkish-tokenization
Language:Python 3
kojix2 / blingfire-crystal
crystal tokenizers
Language:Crystal 3
willsaliba / LDR_Transformer
ML Model designed to learn compositional structure of LEGO assemblies
machine-learning tokenizers transformer
Language:Python 3
Jeronymous / deep_learning_notebooks
Self-containing notebooks to play simply with some particular concepts in Deep Learning
artificial-intelligence artificial-neural-networks automatic-speech-recognition deep-learning deep-neural-networks machine-learning natural-language-processing speech-recognition speech-to-text tokenization tokenizer-nlp tokenizers
Language:Jupyter Notebook 2
jungsoh / transformers-question-answering
Fine tuning pre-trained transformer models in TensorFlow and in PyTorch for question answering
question-answering distilbert-model huggingface-transformers tensorflow pytorch babi-dataset pytorch-api tokenizers gradient-tape
Language:Jupyter Notebook 2
Rishi-Kora / Tokenizers-using-HuggingFace
Explore how Hugging Face tokenizers work across models like LLaMA, PHI-3, and StarCoder2. Includes examples for encoding, decoding, chat formatting, and token visualization. Ideal for understanding text preprocessing in LLMs.
decoding encoding huggingface jupyter-notebook openai tokenizers token-visualization
Language:Jupyter Notebook 2
adkwn1 / question-answer-app
Question and Answer web applicaiton using fine-tuned and pre-trained T5 models. Application runs on Streamlit.
python question-answering streamlit t5 text-generation tokenizers transformers summarization
Language:Jupyter Notebook 1
duoan / ReplicateAI
Recreating every milestone in Machine Learning and Artificial Intelligence
ai bert llama llava llm ml qwen reproduce transformer tokenizers deep-learning reproducibility ai-history foundation-models machine-learning
Language:Python 1
helena-intel / test-prompt-generator
Create prompts with a given token length for testing LLMs and other transformers text models.
benchmarking llm llm-inference nlp tokenizers transformers
Language:Python 1
lepisma / tokenizers.el
Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library
emacs-lisp rust tokenizers
Language:Rust 1
mkashirin / cattode
Lil GPT and BPE built from scratch using PyTorch.
deeplearning gpt pytorch tokenizers bpe languagemodels decoder
Language:Python 1
victoryosiobe / kingchop
Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.
nodejs tokenizers javascript text-processing text-tokenization natural-language-processing sentence-tokenizer word-tokenizer paragraph-tokenizer
Language:JavaScript 1
bimri / precious
A tokenizer-free NLP library with T-FREE, CANINE, and byte-level approaches
nlp tokenization tokenizers tokenizer-free
Language:Python 0
ahmad-alghadban / BPE-tokenizer
Optimized implementation for Byte-Pair encoding algorithm that can process billions of words in few minutes in medium-resources computer
nlp optimization optimization-algorithms tokenization tokenizers
Language:Python
amikos-tech / pure-tokenizers
Purego Tokenizers
ai huggingface tokenizers
Language:Go
kantkrishan0206-crypto / LLM-building-a-Large-Language-Model-LLM-
is a comprehensive, educational project dedicated to building a Large Language Model (LLM) from the ground up. It serves as the official code repository for the book Build a Large Language Model (From Scratch), guiding developers step-by-step through the process of developing, pretraining, finetuning, and aligning a GPT-like LLM using PyTorch.
pytorch transformers numpy tokenizers deep-learning artificial-intelligence machine-learning-algorithms neural-network reinforcement-learning rhlf model-training
Language:Python
NotShrirang / marathi-tokenizer
🖋️ A sleek, BPE-powered tokenizer that understands the richness of Marathi.
bpe byte-pair-encoding huggingface huggingface-transformers tokenizer tokenizers transformers
Language:Python
VenkatRamaraju / polydb
a vector database + embedding model written from scratch in go
embedding-models tokenizers vector-database
Language:Go

tokenizers

xebia-functional / xef

jshuadvd / LongRoPE

ImadSaddik / ElasticSearch_Python_Course

Arunprakash-A / DL-Pytorch-Workshop

Prismadic / magnet

sayakpaul / count-tokens-hf-datasets

gweidart / rs-bpe

1kkiRen / Tokenizer-Changer

megagonlabs / ginza-transformers

sappho192 / Tokenizers.DotNet

symanto-research / merge-tokenizers

Anush008 / tokenizers

Hugging-Face-Supporter / tftokenizers

unfoldingWord / string-punctuation-tokenizer

arturom / search-analysis

Beomi / megatronlm_dataset_autotokenizer

mickymultani / LLM-Architecture

wassemgtk / SuperTokenizer

cobanov / turkish-bpe-tokenizer

kojix2 / blingfire-crystal

willsaliba / LDR_Transformer

Jeronymous / deep_learning_notebooks

jungsoh / transformers-question-answering

Rishi-Kora / Tokenizers-using-HuggingFace

adkwn1 / question-answer-app

duoan / ReplicateAI

helena-intel / test-prompt-generator

lepisma / tokenizers.el

mkashirin / cattode

victoryosiobe / kingchop

bimri / precious

ahmad-alghadban / BPE-tokenizer

amikos-tech / pure-tokenizers

kantkrishan0206-crypto / LLM-building-a-Large-Language-Model-LLM-

NotShrirang / marathi-tokenizer

VenkatRamaraju / polydb