open source ChatGPT and beyond

On the road to implement open-source ChatGPT-like models and beyond.

Since the accidental leak of LLaMA model weights, and the impressive performance of Stanford Alpaca, which is trained on LLaMA using data generated by GPT-3 api with the self-instruct technique, the open-source community has been excited about the promising future of reproducing ChatGPT in an open way.

This repo aims at recording this process, and providing an overview of how to get involved.

Including: base models, technologies, data, domain models, training pipelines, speed up techniques, multi-language, multi-modal, and more to go.

thanks @FunnySaltyFish for the website version, here is the code.

Any contribution to this project and the website is appreciated! (we are short of hands ...)

Base Models

contributor	model/project	multi-modal	license	language	main feature
Meta	LLaMA	✖		en	LLaMA-13B outperforms GPT-3(175B) and LLaMA-65B is competitive to PaLM-540M. Base model for most follow-up works.
THU	ChatGLM-6B	✖		en/zh	well-known Chinese model, in chat mode, and can run on single GPU.
HuggingFace-BigScience	BLOOM	✖		multi	an autoregressive Large Language Model (LLM) trained by HuggingFace BigScience.
HuggingFace-BigScience	BLOOMZ	✖		multi	instruction-finetuned version of BLOOM & mT5 pretrained multilingual language models on crosslingual task mixture.
EleutherAI	GPT-J	✖		en	transformer model trained using Ben Wang'sMesh Transformer JAX.
Meta	OPT	✖		en	Open Pre-trained Transformer Language Models, aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and to bring more voices to the table in studying the impact of these LLMs.
Cerebras Systems	Cerebras-GPT	✖		en	Pretrained LLM, GPT-3 like, Commercially available, efficiently trained on theAndromeda AI supercomputer, trained in accordance withChinchilla scaling laws (20 tokens per model parameter) which is compute-optimal.
EleutherAI	pythia	✖		en	combine interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers.
Stability-AI	StableLM	✖		en	Stability AI Language Models
FDU	MOSS	✖		en/zh	An open-source tool-augmented conversational language model from Fudan University.
ssymmetry & FDU	BBT-2	✖		zh	12B open-source LM.
@mlfoundations	OpenFlamingo	✅		en	An open-source framework for training large multimodal models.
EleutherAI	GPT-NeoX-20B	✖		en	Its architecture intentionally resembles that of GPT-3, and is almost identical to that ofGPT-J- 6B.
UCB	OpenLLaMA	✖	Apache-2.0	en	An Open Reproduction of LLaMA.
MosaicML	MPT	✖	Apache-2.0	en	MPT-7B is a GPT-style model, and the first in the MosaicML Foundation Series of models. Trained on 1T tokens of a MosaicML-curated dataset, MPT-7B is open-source, commercially usable, and equivalent to LLaMa 7B on evaluation metrics.
TogetherComputer	RedPajama-INCITE-Base-3B-v1	✖	Apache-2.0	en	A 2.8B parameter pretrained language model, pretrained onRedPajama-Data-1T, together with an Instruction-tuned Version and a Chat Version.
Lightning-AI	Lit-LLaMA	✖	Apache-2.0	-	Independent implementation ofLLaMA that is fully open source under the Apache 2.0 license.

Domain Models

contributor	model	domain	language	base model	main feature
UT Southwestern/ UIUC/OSU/HDU	ChatDoctor	medical	en	LLaMA	Maybe the first domain-specific chat model tuned on LLaMA.
Cambridge	Visual Med-Alpaca	biomedical	en	LLaMA-7B	a multi-modal foundation model designed specifically for the biomedical domain.
HIT	Huatuo / ChatGLM-Med	medical	zh	LLaMA/ChatGLM	ine-tuned with Chinese medical knowledge dataset, which is generated by using gpt3.5 api.
ShanghaiTech, etc	DoctorGLM	medical	en/zh	ChatGLM-6B	Chinese medical consultation model fine-tuned on ChatGLM-6B.
THU AIR	BioMedGPT-1.6B	biomedical	en/zh	-	a pre-trained multi-modal molecular foundation model with 1.6B parameters that associates 2D molecular graphs with texts.
@LiuHC0428	LawGPT_zh	legal	zh	ChatGLM-6B	a general model in Chinese legal domain, trained on data generated via Reliable-Self-Instruction.
SJTU	MedicalGPT-zh	medical	zh	ChatGLM-6B	a general model in Chinese medical domain, a diverse data generated via self-instruct.
SJTU	PMC-LLaMA	medical	zh	LLaMA	Continue Training LLaMA on Medical Papers.
HuggingFace	StarCoder	code generation	en	-	a language model (LM) trained on source code and natural language text. Its training data incorporates more than 80 different programming languages as well as text extracted from GitHub issues and commits and from notebooks.
@CogStack	NHS-LLM	medical	en	not clear	A conversational model for healthcare trained usingOpenGPT.
@pengxiao-song	LaWGPT	legal	zh	LLaMA/ChatGLM	expand the vocab with Chinese legal terminologies, instruction fine-tuned on data generated using self-instruct.

General Domain Instruction Models

contributor	model/project	language	base model	main feature
Stanford	Alpaca	en	LLaMA/OPT	use 52K instruction-following data generated by Self-Instructt techniques to fine-tune 7B LLaMA, the resulting model, Alpaca, behaves similarly to the `text-davinci-003` model on the Self-Instruct instruction-following evaluation suite. Alpaca has inspired many follow-up models.
LianJiaTech	BELLE	en/zh	BLOOMZ-7B1-mt	maybe the first Chinese model to follow Alpaca.
Databricks	Dolly	en	GPT-J 6B	use Alpaca data to fine-tune a 2-year-old model: GPT-J, which exhibits surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based.
@tloen	Alpaca-LoRA	en	LLaMA-7B	trained within hours on a single RTX 4090, reproducing the Stanford Alpaca results using low-rank adaptation (LoRA), and can run on a Raspberry pi.
ColossalAI	Coati7B	en/zh	LLaMA-7B	a large language model developed by the ColossalChat project
Shanghai AI Lab	LLaMA-Adapter	en	LLaMA-7B	Fine-tuning LLaMA to follow instructions within 1 Hour and 1.2M Parameters
AetherCortex	Llama-X	en	LLaMA	Open Academic Research on Improving LLaMA to SOTA LLM.
TogetherComputer	OpenChatKit	en	GPT-NeoX-20B	OpenChatKit provides a powerful, open-source base to create both specialized and general purpose chatbots for various applications. The kit includes an instruction-tuned language models, a moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories.
nomic-ai	GPT4All	en	LLaMA	trained on a massive collection of clean assistant data including code, stories and dialogue
@ymcui	Chinese-LLaMA-Alpaca	en/zh	LLaMA-7B/13B	expand the Chinese vocabulary based on the original LLaMA and use Chinese data for secondary pre-training, further enhancing Chinese basic semantic understanding. Additionally, the project uses Chinese instruction data for fine-tuning on the basis of the Chinese LLaMA, significantly improving the model's understanding and execution of instructions.
UC Berkley Stanford CMU	Vicuna	en	LLaMA-13B	Impressing GPT-4 with 90% ChatGPT Quality.
UCSD/SYSU	baize	en/zh	LLaMA	fine-tuned withLoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself. Alpaca's data is also used to improve its performance.
UC Berkley	Koala	en	LLaMA	Rather than maximizingquantity by scraping as much web data as possible, the team focus on collecting a small high-quality dataset.
@imClumsyPanda	langchain-ChatGLM	en/zh	ChatGLM-6B	local knowledge based ChatGLM with langchain.
@yangjianxin1	Firefly	zh	bloom-1b4-zh bloom-2b6-zh	Instruction Tuning on Chinese dataset. Vocabulary pruning, ZeRO, and tensor parallelism are used to effectively reduce memory consumption and improve training efficiency.
microsoft	GPT-4-LLM	en/zh	LLaMA	aims to share data generated by GPT-4 for building an instruction-following LLMs with supervised learning and reinforcement learning.
Hugging Face	StackLLaMA	en	LLaMA	trained on StackExchange data and the main goal is to serve as a tutorial and walkthrough on how to train model with RLHF and not primarily model performance.
Nebuly	ChatLLaMA	en	-	a library that allows you to create hyper-personalized ChatGPT-like assistants using your own data and the least amount of compute possible.
@juncongmoo	ChatLLaMA	en	LLaMA	LLaMA-based RLHF model, runnable in a single GPU.
@juncongmoo	minichatgpt	en	GPT/OPT ...	To Train ChatGPT In 5 Minutes with ColossalAI.
@LC1332	Luotuo-Chinese-LLM	zh	LLaMA/ChatGLM	Instruction fine-tuned Chinese Language Models, with colab provided!
@Facico	Chinese-Vicuna	zh	LLaMA	A Chinese Instruction-following LLaMA-based Model, fine-tuned with Lora, cpp inference supported, colab provided.
@yanqiangmiffy	InstructGLM	en/zh	ChatGLM-6B	ChatGLM based instruction-following model, fine-tuned on a variety of data sources, supports deepspeed accelerating and LoRA.
alibaba	Wombat	en	LLaMA	a novel learning paradigm called RRHF, as an alternative of RLHF, is proposed, which scores responses generated by different sampling policies and learns to align them with human preferences through ranking loss. And the performance is comparable to RLHF, with less models used in the process.
@WuJunde	alpaca-glassoff	en	LLaMA	a mini image-acceptable Chat AI can run on your own laptop, based onstanford-alpaca and alpaca-lora.
@JosephusCheung	Guanaco	multi	LLaMA-7B	A Multilingual Instruction-Following Language Model.
BlinkDL	ChatRWKV	en/zh	RWKV-LM	powered by RWKV (100% RNN), Training sponsored by Stability EleutherAI.
@FreedomIntelligence	LLM Zoo	multi	BLOOMZ/LLaMA	a project that provides data, models, and evaluation benchmark for large language models. model released: Phoenix, Chimera
SZU	Linly	en/zh	LLaMA	expand the Chinese vocabulary, full fine-tuned models, largest LLaMA-based Chinese models, aggregation of Chinese instruction data, reproduceable details..
@lamini-ai	lamini	multi	-	data generator for generating instructions to train instruction-following LLMs.
Stability-AI	StableVicuna	en	LLaMA	a further instruction fine tuned and RLHF trained version of Vicuna v0 13b, with better performance than Vicuna.
Hugging Face	HuggingChat	en	LLaMA	seems to be the first one available to access as a platform that appears similar to ChatGPT.
microsoft	WizardLM	en	LLaMA-7B	trained with 70k evolved instructions,Evol-Instruct is a novel method using LLMs instead of humans to automatically mass-produce open-domain instructions of various difficulty levels and skills range, to improve the performance of LLMs.
FDU	OpenChineseLLaMA	en/zh	LLaMA-7B	further pretrain LLaMA on Chinese data, improving LLaMA preformance on Chinese tasks.
@chenfeng357	open-Chinese-ChatLLaMA	en/zh	LLaMA	The complete training code of the open-source Chinese-Llama model, including the full process from pre-training instructing and RLHF.
@FSoft-AI4Code	CodeCapybara	en	LLaMA	Open Source LLaMA Model that Follow Instruction-Tuning for Code Generation.
@mbzuai-nlp	LaMini-LM	en	LLaMA/Flan-T5 ...	A Diverse Herd of Distilled Models from Large-Scale Instructions.
NTU	Panda	en/zh	LLaMA	further pretraining on Chinese data, full-size of LLaMA models.
@hiyouga	ChatGLM-Efficient-Tuning	en/zh	ChatGLM-6B	efficient fine-tuning ChatGLM-6B with PEFT.
IBM/CMU/MIT	Dromedary	en	LLaMA-65B	Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision.
@melodysdreamj	WizardVicunaLM	multi	Vicuna	Wizard's dataset + ChatGPT's conversation extension + Vicuna's tuning method, achieving approximately 7% performance improvement over Vicuna.

Multi-Modal

contributor	project	language	base model	main feature
BaihaiAIen/zh	IDPChat	en/zh	LLaMA-13B Stable Diffusion	Open Chinese multi-modal model, single GPU runnable, easy to deploy, UI provided.
KAUST	MiniGPT-4	en/zh	LLaMA	MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer, and yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.
UW–Madison/MSR /Columbia University	LLaVA	en	LLaMA	visual instruction tuning is proposed, towards building large language and vision models with GPT-4 level capabilities.
NUS/THU	VPGTrans	en	LLaMA/OPT/ Flan-T5/BLIP-2 ...	transferring VPG across LLMs to build VL-LLMs at significantly lower cost. The GPU hours can be reduced over 10 times and the training data can be reduced to around 10%. Two novel VL-LLMs are released via VPGTrans, including VL-LLaMA and VL-Vicuna. VL-LLaMA is a multimodal version LLaMA by transferring the BLIP-2 OPT-6.7B to LLaMA via VPGTrans. VL-Vicuna is a GPT-4-like multimodal chatbot, based on the Vicuna LLM.
CAS, etc	X-LLM	en/zh	ChatGLM-6B	X-LLM converts multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and feed them into a large Language Model (ChatGLM) to accomplish a Multimodal LLM, achieving impressive multimodal chat capabilities.
NTU	Otter	en	OpenFlamingo	a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. Futhermore, optimize OpenFlamingo's implementation, democratizing the required training resources from 1x A100 GPU to 4x RTX-3090 GPUs.

Data

Pretrain Data

contributor	data	language	main feature
TogetherComputer	RedPajama-Data	en	An Open Source Recipe to Reproduce LLaMA training dataset.

Instruction Data

see Alpaca-CoT data collection

Synthetic Data Generation

contributor	method	main feature
UW, etc.	self-instruct	using the model's own generations to create a large collection of instructional data.
@LiuHC0428	Reliable-Self-Instruction	use ChatGPT to generate some questions and answers based on a given text.
PKU	Evol-Instruct	a novel method, proposed inWizardLM, by using LLMs instead of humans to automatically mass-produce open-domain instructions of various difficulty levels and skills range, to improve the performance of LLMs.
KAUST, etc.	CAMEL	a novel communicative agent framework namedrole-playing is proposed, which involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. role-playing can be used to generate conversational data in a specific task/domain.
@chatarena	ChatArena	a library that provides multi-agent language game environments and facilitates research about autonomous LLM agents and their social interactions. it provides a flexible framework to define multiple players, environments and the interactions between them, based on Markov Decision Process.

Evaluation

contributor	method	main feature
-	human evalation	-
OpenAI	GPT-4/ChatGPT	-
PKU/CMU/MSRA ...	PandaLM	Reproducible and Automated Language Model Assessment.
UCB	Chatbot Arena	Chat with two anonymous models side-by-side and vote for which one is better, then use the Elo rating system to calculate the relative performance of the models.

Framework/ToolKit/Platform

contributor	project	main feature
CAS	Alpaca-CoT	extend CoT data to Alpaca to boost its reasoning ability. aims at building an instruction finetuning (IFT) platform with extensive instruction collection (especially the CoT datasets) and a unified interface for various large language models.
ColossalAI	ColossalChat	An open-source low cost solution for cloningChatGPT with a complete RLHF pipeline.
microsoft	deepspeed-chat	Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
LAION-AI	Open Assistant	a project meant to give everyone access to a great chat based large language model.
HKUST	LMFlow	an extensible, convenient, and efficient toolbox for finetuning large machine learning models, designed to be user-friendly, speedy and reliable, and accessible to the entire community.
UCB	EasyLM	EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax. EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX's pjit functionality.
@CogStack	OpenGPT	A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).
HugAILab	HugNLP	a unified and comprehensive NLP library based on HuggingFace Transformer.

Alignment

contributor	method	used in	main feature
-	IFT	ChatGPT	Instruction Fine-Tuning.
-	RLHF	ChatGPT	RL from Human Feedback.
Anthropic	RLAIF	Claude	RL from AI Feedback.
alibaba	RRHF	Wombat	a novel learning paradigm called RRHF, as an alternative of RLHF, is proposed, which scores responses generated by different sampling policies and learns to align them with human preferences through ranking loss. And the performance is comparable to RLHF, with less models used in the process.
HKUST	RAFT	-	RAFT is a new alignment algorithm, which is more efficient than conventional (PPO-based) RLHF.
IBM/CMU/MIT	SELF-ALIGN	Dromedary	combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
PKU	CVA	Beaver	Constrained Value Alignment via Safe RLHF.

Multi-Language

vocabulary expansion

according to the official FAQ in LLaMA repo, there's not many tokens other than latin languages, so one of the efforts is to expand the vocabulary, some works are shown below:

contributor	model/project	language	base model	main feature
@ymcui	Chinese-LLaMA-Alpaca	zh	LLaMA
SZU	Linly	en/zh	LLaMA	full-size LLaMA, further pretrained on Chineses Corpus.
@Neutralzz	BiLLa	en/zh	LLaMA-7B	further pretrained onWudao、PILE、WMT.
@pengxiao-song	LaWGPT	zh	zhLLaMA/ChatGLM	expand the vocab with Chinese legal terminologies, instruction fine-tuned on data generated using self-instruct.

Efficient Training/Fine-Tuning

contributor	method	main feature
microsoft	LoRA	Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
stanford	Prefix Tuning	a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which we call the prefix.
THU	P-Tuning	P-tuning leverages few continuous free parameters to serve as prompts fed as the input to the pre-trained language models. We then optimize the continuous prompts using gradient descent as an alternative to discrete prompt searching.
THU/BAAI/ Shanghai Qi Zhi Institute	P-Tuning v2	a novel empirical finding that properly optimized prompt tuning can be comparable to fine-tuning universally across various model scales and NLU tasks. Technically, P-tuning v2 is not conceptually novel. It can be viewed as an optimized and adapted implementation of Deep Prompt Tuning.
Google	Prompt Tuning	a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Prompt Tuning can be seen as a simplification of "prefix tuning".
GT/Princeton/microsoft	AdaLoRA	adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition.

acknowledgement: HuggingFace Peft

Low-Cost Inference

contributor	project	main feature
@ggerganov	llama.cpp	c/cpp implementation for llama and some other models, using quantization.
@NouamaneTazi	bloomz.cpp	C++ implementation for BLOOM inference.
@mlc-ai	MLC LLM	a universal solution that allows any language models to be deployed natively on a diverse set of hardware backends and native applications, plus a productive framework for everyone to further optimize model performance for their own use cases.
alibaba	ChatGLM-MNN	converts the ChatGLM-6B model to MNN and performs inference using C++.
Jittor	JittorLLMs	Significantly reduce hardware costs (by 80%), currently known as the lowest-cost deployment library, supports multiple platforms.
OpenBMB	BMInf	BMInf supports running models with more than 10 billion parameters on a single NVIDIA GTX 1060 GPU in its minimum requirements. In cases where the GPU memory supports the large model inference (such as V100 or A100), BMInf still has a significant performance improvement over the existing PyTorch implementation.
hpcaitech	EnergonAI	With tensor parallel operations, pipeline parallel wrapper, distributed checkpoint loading, and customized CUDA kernel, EnergonAI can enable efficient parallel inference for larges-scale models.
MegEngine	InferLLM	a lightweight LLM model inference framework that mainly references and borrows fromthe llama.cpp project. llama.cpp puts almost all core code and kernels in a single file and use a large number of macros, making it difficult for developers to read and modify.
@saharNooby	rwkv.cpp	a port ofBlinkDL/RWKV-LM to ggerganov/ggml.
FMInference	FlexGen	FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allowshigh-throughput generation by IO-efficient offloading, compression, and large effective batch sizes .
huggingface bigcode-project	starcoder.cpp	C++ implemention for 💫 StarCoder inference using theggml library.

Satefy

contributor	method	main feature
thu-coai	Safety-Prompts	Chinese safety prompts for evaluating and improving the safety of LLMs.

Input Length Extrapolation

contributor	method	main feature
UW, etc.	ALiBi	Instead of adding position embeddings at the bottom of the transformer stack, ALiBi adds a linear bias to each attention score, allowing the model to be trained on, for example, 1024 tokens, and then do inference on 2048 (or much more) tokens without any finetuning.
DeepPavlov, etc.	RMT	use a recurrent memory to extend the context length.
bytedance	SCM	unleash infinite-length input capacity for large-scale language models.
BlinkDL	RWKV-LM	pure RNN.

xiahuawuyu / open_source_chatgpt_list