LLaMA

This is a variant of the LLaMA model and has the following changes:

Compression: 8-bit model quantization using bitsandbytes
Non-Model Parallel (MP): run 13B model in a single GPU. All MP codes removed.
Extended model:
- Fix the sampler — a better sampler that improve generations quality: temperature, top_p, repetition_penalty, tail_free.
- (Future): provides more controls for generations, expose repetition penalty so that CLI can pass-in the options.

And more soon. I'm experimenting with compression and acceleration techniques to make the models:

smaller and faster
run on low-resources hardwares

I'm also building LLaMA-based ChatGPT.

Hardware

ChattyLLaMA

ChattyLLaMA is experimental LLaMA-based ChatGPT.

Documentations

All the new codes are available in the chattyllama directory.

Combined

All changes and fixes baked into one:

Non-Model Parallel (MP): all MP constructs removed (MP shards weights across a GPU cluster setup)
8-bit quantized model using bitsandbytes
Sampler fixes, better sampler

Source files location:

chattyllama/combined/model.py: a fork of LLaMA model.
chattyllama/combined/inference.py: run model inference (it's a modified copy of example.py).

Non-MP/single GPU

Source files location:

chattyllama/model.py: a fork of LLaMA model.
chattyllama/inference.py: run model inference

Code Examples

Code walkthrough: notebooks.

This shows how you can get it running on 1x A100 40GB GPU. The code is outdated though. It's using the original model version from MetaAI.

For bleeding edge things, follow the below quick start.

Quick start

Download model weights into ./model.
Install all the needed dependencies.

$ git clone https://github.com/cedrickchee/llama.git
$ cd llama && pip install -r requirements.txt

Note:

Don't use Conda. Use pip.
If you have trouble with bitsandbytes, build and install it from source.

$ pip install -e .
#torchrun --nproc_per_node 1 example.py --ckpt_dir ../7B --tokenizer_path ../tokenizer.model
$ cd chattyllama/combined

Modify inference.py with the path to your weights directory:

# ...

if __name__ == "__main__":
    main(
        ckpt_dir="/model/vi/13B", # <-- change the path
        tokenizer_path="/model/vi/tokenizer.model", # <-- change the path
        temperature=0.7,
        top_p=0.85,
        max_seq_len=1024,
        max_batch_size=1
    )

Modify inference.py with your prompt:

def main(...):
    # ...

    prompts = [
        "I believe the meaning of life is"
    ]

    # ...

Run inference:

$ python inference.py

LLaMA compatible port

Looking to use LLaMA model with HuggingFace library? Well look at my "transformers-llama" repo.

Other ports

HuggingFace Transformers LLaMA model
Text generation web UI - A Gradio Web UI for running Large Language Models like LLaMA, GPT-Neo, OPT, and friends. My guide: "Installing 8/4-bit LLaMA with text-generation-webui on Linux"
LLaMa CPU fork - We need more work like this that lower the compute requirements. Really under appreciated.
LLaMA Jax
Running LLaMA 7B on a 64GB M2 MacBook Pro with llama.cpp by Simon Willison - llama.cpp is from the same Whisper.cpp hacker, ggerganov. Never dissapointed by ggerganov's work.

It's genuinely possible to run a LLM that's hinting towards the performance of GPT3 on your own hardware now. I thought that was still a few years away.

Looking at this rate of model compression/acceleration progress, soon we can run a LLM inference locally on mobile devices. QNNPACK, a hardware optimized library that also supports mobile processors can help. JIT compiler like OpenXLA/PyTorch Glow can optimize the computation graph so the model runs well on low-resources hardware.

We underestimated pre-trained language models (~2019) and overestimated a lot of things.

A quick tutorial by me: 4 Steps in Running LLaMA-7B on a M1 MacBook with llama.cpp

My llama.cpp patches for Linux support. (WIP)
Dalai - The simplest way to run LLaMA on your personal computer. It automatically install and run LLaMA on your computer. Powered by llama.cpp and Shawn's llama-dl CDN.
Stanford Alpaca: An Open-Source Instruction-Following LLaMA Model
- Alpaca-LoRA - Fine-tuning and training code for LLaMA to replicate the Alpaca instruct-tuned model on consumer hardware, while awaiting Stanford to release their code.
- Alpaca.cpp - Locally run an instruction-tuned chat-style LLM. This combines the LLaMA foundation model (llama.cpp) with an open reproduction (Alpaca-LoRA) of Stanford Alpaca, a fine-tuning of the base model to obey instructions (akin to the RLHF used to train ChatGPT).
- Alpaca Native model weights - The model was fine-tuned using the original repository: https://github.com/tatsu-lab/stanford_alpaca (no LoRA has been used). Examples:
  - HuggingFace Transformers inference (code and quick start guide).
  - Creating a chatbot using Alpaca native and LangChain

Minimal LLaMA - Jason's HuggingFace Transformers port using OPT code internally. This version should be more stable. But the code is not well-tested yet. Bonus: you can quickly see how well the model can be fine-tuned either using HuggingFace PEFT with 8-bit or Pipeline Parallelism.
pyllama - Run LLM in a single GPU, as simple as pip install pyllama. It's a quick & dirty hacked version of 🦙 LLaMA. Bonus: comes with a way to start a Gradio Web UI for trying out prompting in browser. Good tips: "To load KV cache in CPU, run export KV_CAHCHE_IN_GPU=0 in the shell.".
minichatgpt - Train ChatGPT in minutes with ColossalAI (blog post) (minichatgpt training process is pending my verification. I can confirm the code there was based on ColossalAI's mini demo. It doesn't support LLaMA yet.)
- Supports LoRA
- Supports RL paradigms, like reward model, PPO
- Datasets used for training:
  - Train with prompt data from: fka/awesome-minichatgpt-prompts. Training scripts and instructions here.
  - Train the reward model using Dahoas/rm-static dataset.

Supporting tools

Resharding and HuggingFace conversion - Useful scripts for transforming the weights, if you still want to spread the weights and run the larger model (in fp16 instead of int8) across multiple GPUs for some reasons.

Plan

TODO:

Priority: high

Improve sampler - refer to shawwn/llama fork.
Fine-tune the models on a diverse set of instructions datasets from LAION's OpenAssistant. Check out my ChatGPT notes for larger training data. (blocked by dataset v1)
Try the fine-tuning protocol from Flan.
- LLaMA paper touches on finetuning briefly, referencing that.
Fine-tune model with HF's PEFT and Accelerate. PEFT doesn't support causal LM like LLaMA yet (blocked by PR)

Priority: low

Start and try other fine-tuning ideas:
- ChatGPT-like = LLaMA + CarperAI's tRLX (RLHF) library + Anthropic's public preference dataset. I don't know how feasible if the experiments are larger scale (compute-wise) that use RL models that are good at instruction following.

Reminder-to-self:

People under-appreciate fine-tuning alone compared to RLHF. RL algorithms (unsupervised) are quite finicky compared to supervised deep learning. RL is hard-ish.

Original README

This repository is intended as a minimal, hackable and readable example to load LLaMA (arXiv) models and run inference. In order to download the checkpoints and tokenizer, fill this google form

Setup

In a conda env with pytorch / cuda available, run:

pip install -r requirements.txt

Then in this repository:

pip install -e .

Download

Once your request is approved, you will receive links to download the tokenizer and model files. Edit the download.sh script with the signed url provided in the email to download the model weights and tokenizer.

Inference

The provided example.py can be run on a single or multi-gpu node with torchrun and will output completions for two pre-defined prompts. Using TARGET_FOLDER as defined in download.sh:

torchrun --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model

Different models require different MP values:

Model	MP
7B	1
13B	2
33B	4
65B	8

FAQ

Reference

LLaMA: Open and Efficient Foundation Language Models -- https://arxiv.org/abs/2302.13971

@article{touvron2023llama,
  title={LLaMA: Open and Efficient Foundation Language Models},
  author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
  journal={arXiv preprint arXiv:2302.13971},
  year={2023}
}

Model Card

See MODEL_CARD.md

License

See the LICENSE file.

cedrickchee / llama