cedrickchee / llama

Inference code for LLaMA 2 models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LLaMA

This is a variant of the LLaMA model and has the following changes:

  • Compression: 8-bit model quantization using bitsandbytes
  • Non-Model Parallel (MP): run 13B model in a single GPU. All MP codes removed.
  • Extended model:
    • Fix the sampler — a better sampler that improve generations quality: temperature, top_p, repetition_penalty, tail_free.
    • (Future): provides more controls for generations, expose repetition penalty so that CLI can pass-in the options.

And more soon. I'm experimenting with compression and acceleration techniques to make the models:

  • smaller and faster
  • run on low-resources hardwares

I'm also building LLaMA-based ChatGPT.

Hardware

ChattyLLaMA

ChattyLLaMA is experimental LLaMA-based ChatGPT.

Documentations

All the new codes are available in the chattyllama directory.

Combined

All changes and fixes baked into one:

  • Non-Model Parallel (MP): all MP constructs removed (MP shards weights across a GPU cluster setup)
  • 8-bit quantized model using bitsandbytes
  • Sampler fixes, better sampler

Source files location:

  • chattyllama/combined/model.py: a fork of LLaMA model.
  • chattyllama/combined/inference.py: run model inference (it's a modified copy of example.py).

Non-MP/single GPU

Source files location:

  • chattyllama/model.py: a fork of LLaMA model.
  • chattyllama/inference.py: run model inference

Code Examples

Code walkthrough: notebooks.

This shows how you can get it running on 1x A100 40GB GPU. The code is outdated though. It's using the original model version from MetaAI.

For bleeding edge things, follow the below quick start.

Quick start

  1. Download model weights into ./model.

  2. Install all the needed dependencies.

$ git clone https://github.com/cedrickchee/llama.git
$ cd llama && pip install -r requirements.txt

Note:

$ pip install -e .
#torchrun --nproc_per_node 1 example.py --ckpt_dir ../7B --tokenizer_path ../tokenizer.model
$ cd chattyllama/combined
  1. Modify inference.py with the path to your weights directory:
# ...

if __name__ == "__main__":
    main(
        ckpt_dir="/model/vi/13B", # <-- change the path
        tokenizer_path="/model/vi/tokenizer.model", # <-- change the path
        temperature=0.7,
        top_p=0.85,
        max_seq_len=1024,
        max_batch_size=1
    )
  1. Modify inference.py with your prompt:
def main(...):
    # ...

    prompts = [
        "I believe the meaning of life is"
    ]

    # ...
  1. Run inference:
$ python inference.py

LLaMA compatible port

Looking to use LLaMA model with HuggingFace library? Well look at my "transformers-llama" repo.

Other ports

See more
  • Minimal LLaMA - Jason's HuggingFace Transformers port using OPT code internally. This version should be more stable. But the code is not well-tested yet. Bonus: you can quickly see how well the model can be fine-tuned either using HuggingFace PEFT with 8-bit or Pipeline Parallelism.
  • pyllama - Run LLM in a single GPU, as simple as pip install pyllama. It's a quick & dirty hacked version of 🦙 LLaMA. Bonus: comes with a way to start a Gradio Web UI for trying out prompting in browser. Good tips: "To load KV cache in CPU, run export KV_CAHCHE_IN_GPU=0 in the shell.".
  • minichatgpt - Train ChatGPT in minutes with ColossalAI (blog post) (minichatgpt training process is pending my verification. I can confirm the code there was based on ColossalAI's mini demo. It doesn't support LLaMA yet.)

Supporting tools

  • Resharding and HuggingFace conversion - Useful scripts for transforming the weights, if you still want to spread the weights and run the larger model (in fp16 instead of int8) across multiple GPUs for some reasons.

Plan

TODO:

Priority: high

  • Improve sampler - refer to shawwn/llama fork.
  • Fine-tune the models on a diverse set of instructions datasets from LAION's OpenAssistant. Check out my ChatGPT notes for larger training data. (blocked by dataset v1)
  • Try the fine-tuning protocol from Flan.
    • LLaMA paper touches on finetuning briefly, referencing that.
  • Fine-tune model with HF's PEFT and Accelerate. PEFT doesn't support causal LM like LLaMA yet (blocked by PR)

Priority: low

  • Start and try other fine-tuning ideas:
    • ChatGPT-like = LLaMA + CarperAI's tRLX (RLHF) library + Anthropic's public preference dataset. I don't know how feasible if the experiments are larger scale (compute-wise) that use RL models that are good at instruction following.

Reminder-to-self:

  • People under-appreciate fine-tuning alone compared to RLHF. RL algorithms (unsupervised) are quite finicky compared to supervised deep learning. RL is hard-ish.

Original README

This repository is intended as a minimal, hackable and readable example to load LLaMA (arXiv) models and run inference. In order to download the checkpoints and tokenizer, fill this google form

Setup

In a conda env with pytorch / cuda available, run:

pip install -r requirements.txt

Then in this repository:

pip install -e .

Download

Once your request is approved, you will receive links to download the tokenizer and model files. Edit the download.sh script with the signed url provided in the email to download the model weights and tokenizer.

Inference

The provided example.py can be run on a single or multi-gpu node with torchrun and will output completions for two pre-defined prompts. Using TARGET_FOLDER as defined in download.sh:

torchrun --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model

Different models require different MP values:

Model MP
7B 1
13B 2
33B 4
65B 8

FAQ

Reference

LLaMA: Open and Efficient Foundation Language Models -- https://arxiv.org/abs/2302.13971

@article{touvron2023llama,
  title={LLaMA: Open and Efficient Foundation Language Models},
  author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
  journal={arXiv preprint arXiv:2302.13971},
  year={2023}
}

Model Card

See MODEL_CARD.md

License

See the LICENSE file.

About

Inference code for LLaMA 2 models

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 69.4%Language:Python 29.8%Language:Shell 0.8%