Chunlin Wang's starred repositories

MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

License:MITStargazers:3265Issues:0Issues:0

Alpaca-CoT

We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tuning) together for easy use. We welcome open-source enthusiasts to initiate any meaningful PR on this repo and integrate as many LLM related technologies as possible. 我们打造了方便研究人员上手和使用大模型等微调平台,我们欢迎开源爱好者发起任何有意义的pr!

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:2536Issues:0Issues:0

datachain

Datachain is a peer to peer blockchain that powers the Betcoin cryptocurrency.

Language:C++License:MITStargazers:6Issues:0Issues:0

ChatEval

Codes for our paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate"

Language:PythonLicense:Apache-2.0Stargazers:216Issues:0Issues:0

HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

Language:PythonLicense:Apache-2.0Stargazers:33193Issues:0Issues:0

MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。

Language:PythonLicense:AGPL-3.0Stargazers:5246Issues:0Issues:0

MathPile

Generative AI for Math: MathPile

Language:PythonLicense:Apache-2.0Stargazers:364Issues:0Issues:0

CLUEDatasetSearch

搜索所有中文NLP数据集,附常用英文NLP数据集

Language:PythonStargazers:4040Issues:0Issues:0

LLMDataHub

A quick guide (especially) for trending instruction finetuning datasets

License:MITStargazers:2299Issues:0Issues:0

Explore-Instruct

EMNLP'2023: Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration

Language:PythonLicense:Apache-2.0Stargazers:31Issues:0Issues:0

awesome-synthetic-datasets

awesome synthetic (text) datasets

Language:Jupyter NotebookLicense:CC-BY-SA-4.0Stargazers:193Issues:0Issues:0

ambrosia

clean up your LLM datasets

Language:GoLicense:MITStargazers:113Issues:0Issues:0

data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!

Language:PythonLicense:Apache-2.0Stargazers:1850Issues:0Issues:0

llm-data-creation

Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"

Language:PythonLicense:MITStargazers:103Issues:0Issues:0

llama3-from-scratch

llama3 implementation one matrix multiplication at a time

Language:Jupyter NotebookLicense:MITStargazers:11623Issues:0Issues:0

minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Language:PythonLicense:MITStargazers:8825Issues:0Issues:0

sft_datasets

开源SFT数据集整理,随时补充

Stargazers:397Issues:0Issues:0

timesfm

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.

Language:PythonLicense:Apache-2.0Stargazers:3307Issues:0Issues:0

LLaMA-Factory

A WebUI for Efficient Fine-Tuning of 100+ LLMs (ACL 2024)

Language:PythonLicense:Apache-2.0Stargazers:28165Issues:0Issues:0

Awesome-Chinese-LLM

整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。

Stargazers:13837Issues:0Issues:0
Language:PythonLicense:Apache-2.0Stargazers:43Issues:0Issues:0

ChineseNlpCorpus

搜集、整理、发布 中文 自然语言处理 语料/数据集,与 有志之士 共同 促进 中文 自然语言处理 的 发展。

Language:Jupyter NotebookStargazers:5736Issues:0Issues:0

Corpus

数据集

Stargazers:8Issues:0Issues:0

AI-For-Beginners

12 Weeks, 24 Lessons, AI for All!

Language:Jupyter NotebookLicense:MITStargazers:33496Issues:0Issues:0

generative-ai-for-beginners

18 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/

Language:Jupyter NotebookLicense:MITStargazers:57788Issues:0Issues:0

phidata

Build AI Assistants with memory, knowledge and tools.

Language:PythonLicense:MPL-2.0Stargazers:10814Issues:0Issues:0

AISuperDomain

Aila(AI超元域): The premier AI integration tool for Windows, macOS, and Android. Ask once, get answers from 10+ AIs like ChatGPT, Gemini, Claude3, Copilot, Poe, perplexity and more. Features customizable AI and prompts.

Language:C#License:MITStargazers:575Issues:0Issues:0

DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

License:MITStargazers:3172Issues:0Issues:0

playground

Play with neural networks!

Language:TypeScriptLicense:Apache-2.0Stargazers:11875Issues:0Issues:0

Awesome-Causal-RL

A curated list of causal reinforcement learning resources.

License:Apache-2.0Stargazers:27Issues:0Issues:0