Takumi Ito's starred repositories
persona-hub
Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
instructor
structured outputs for llms
distilabel
βοΈ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.
LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
LLMDataHub
A quick guide (especially) for trending instruction finetuning datasets
langkit
π LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). π Extracts signals from prompts & responses, ensuring safety & security. π‘οΈ Features include text quality, relevance metrics, & sentiment analysis. π A comprehensive tool for LLM observability. π
text-dedup
All-in-one text de-duplication
preprocess
Corpus preprocessing
AlignScore
ACL2023 - AlignScore, a metric for factual consistency evaluation.
J-UniMorph
Dataset of UniMorph in Japanese
DataDreamer
DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. β π€π€
TransformerLens
A library for mechanistic interpretability of GPT-style language models
uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.