Beast code in Giters

⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.

Language:PythonApache-2.0122000

Awesome-Knowledge-Distillation-of-LLMs

This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.

43700

deepeval

The LLM Evaluation Framework

Language:PythonApache-2.0256700

text-clustering

Easily embed, cluster and semantically label text datasets

Language:PythonApache-2.040400

llm-datasets

High-quality datasets, tools, and concepts for LLM fine-tuning.

114900

llm-data-creation

Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"

Language:PythonMIT10300

AttrPrompt

[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.

Language:PythonApache-2.012900

pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

Language:PythonApache-2.056400

deduplicate-text-datasets

Language:RustApache-2.0106400

RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Language:PythonApache-2.0446900

MAP-NEO

Language:Python79300

data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据！

Language:PythonApache-2.0185800

FinNLP

Democratizing Internet-scale financial data.

Language:Jupyter NotebookMIT108400

LLMDataHub

A quick guide (especially) for trending instruction finetuning datasets

MIT230100

awesome-instruction-datasets

A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。

Apache-2.046200

IEPile

[OneKE] [ACL 2024] IEPile: A Large-Scale Information Extraction Corpus

Language:PythonNOASSERTION13900

InstructUIE

Universal information extraction with instruction learning

Language:PythonMIT35500

OpenNRE

An Open-Source Package for Neural Relation Extraction (NRE)

Language:PythonMIT428900

Evaluation-of-ChatGPT-on-Information-Extraction

An Evaluation of ChatGPT on Information Extraction task, including Named Entity Recognition (NER), Relation Extraction (RE), Event Extraction (EE) and Aspect-based Sentiment Analysis (ABSA).

Language:Python12000

KnowLM

An Open-sourced Knowledgable Large Language Model Framework.

Language:PythonMIT115200

ChatIE

The online version is temporarily unavailable because we cannot afford the key. You can clone and run it locally. Note: we set defaul openai key. If keys exceed plan and are invalid, please tell us. The response speed depends on openai. ( sometimes, the official is too crowded and slow)

Language:PythonNOASSERTION76800

lm-evaluation-harness

A framework for few-shot evaluation of language models.

Language:PythonMIT601600

llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.

Language:Jupyter Notebook1107200