Change cu121 to cu118 for CUDA version 11.8 or 12.1. Go to https://pytorch.org/ to learn more.
If you get errors, try the below first, then go back to step 1:
pip install --upgrade pip
Documentation
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
from unsloth import FastLlamaModel, FastMistralModel
import torch
max_seq_length = 2048 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
model_name = "unsloth/llama-2-7b", # Supports any llama model eg meta-llama/Llama-2-7b-hf
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
# Do model patching and add fast LoRA weights
model = FastLlamaModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
trainer = .... Use Huggingface's Trainer and dataset loading (TRL, transformers etc)
DPO (Direct Preference Optimization) Experimental support
Hack the model's config.json to be llama model. Example gist.
Use Unsloth for DPO for both base and reference models. Example gist.
Future Milestones and limitations
Support Mixtral.
Does not support non Llama models - we do so in the future.
Performance comparisons on 1 Tesla T4 GPU:
Time taken for 1 epoch
One Tesla T4 on Google Colab
bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
System
GPU
Alpaca (52K)
LAION OIG (210K)
Open Assistant (10K)
SlimOrca (518K)
Huggingface
1 T4
23h 15m
56h 28m
8h 38m
391h 41m
Unsloth Open
1 T4
13h 7m (1.8x)
31h 47m (1.8x)
4h 27m (1.9x)
240h 4m (1.6x)
Unsloth Pro
1 T4
3h 6m (7.5x)
5h 17m (10.7x)
1h 7m (7.7x)
59h 53m (6.5x)
Unsloth Max
1 T4
2h 39m (8.8x)
4h 31m (12.5x)
0h 58m (8.9x)
51h 30m (7.6x)
Peak Memory Usage
System
GPU
Alpaca (52K)
LAION OIG (210K)
Open Assistant (10K)
SlimOrca (518K)
Huggingface
1 T4
7.3GB
5.9GB
14.0GB
13.3GB
Unsloth Open
1 T4
6.8GB
5.7GB
7.8GB
7.7GB
Unsloth Pro
1 T4
6.4GB
6.4GB
6.4GB
6.4GB
Unsloth Max
1 T4
11.4GB
12.4GB
11.9GB
14.4GB
Performance comparisons on 2 Tesla T4 GPUs via DDP:
Time taken for 1 epoch
Two Tesla T4s on Kaggle
bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
System
GPU
Alpaca (52K)
LAION OIG (210K)
Open Assistant (10K)
SlimOrca (518K) *
Huggingface
2 T4
84h 47m
163h 48m
30h 51m
1301h 24m *
Unsloth Pro
2 T4
3h 20m (25.4x)
5h 43m (28.7x)
1h 12m (25.7x)
71h 40m (18.1x) *
Unsloth Max
2 T4
3h 4m (27.6x)
5h 14m (31.3x)
1h 6m (28.1x)
54h 20m (23.9x) *
Peak Memory Usage on a Multi GPU System (2 GPUs)
System
GPU
Alpaca (52K)
LAION OIG (210K)
Open Assistant (10K)
SlimOrca (518K) *
Huggingface
2 T4
8.4GB | 6GB
7.2GB | 5.3GB
14.3GB | 6.6GB
10.9GB | 5.9GB *
Unsloth Pro
2 T4
7.7GB | 4.9GB
7.5GB | 4.9GB
8.5GB | 4.9GB
6.2GB | 4.7GB *
Unsloth Max
2 T4
10.5GB | 5GB
10.6GB | 5GB
10.6GB | 5GB
10.5GB | 5GB *
Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.
Full benchmarking tables
Click "Code" for a fully reproducible example.
"Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.
Manual autograd, Triton kernels etc. See our Benchmark Breakdown for more info!
$$
\begin{align}
y &= \frac{x_i}{\sqrt{\frac{1}{n}\sum{x_i^2}+\epsilon}} \cdot w \\
y &= \frac{x_i}{\sqrt{\frac{1}{n}\sum{x_i^2}+\epsilon}} \cdot w \\
r &= \frac{1}{\sqrt{\frac{1}{n}\sum{x_i^2}+\epsilon}} \\
\frac{dC}{dX} &= \frac{1}{n} r \bigg( n (dY \cdot w) - \bigg( x_i \cdot r \cdot \sum{dY \cdot y_i } \bigg) \bigg)
\end{align}
$$
Troubleshooting
Sometimes bitsandbytes or xformers does not link properly. Try running:
!ldconfig /usr/lib64-nvidia
Windows is not supported as of yet - we rely on Xformers and Triton support, so until both packages support Windows officially, Unsloth will then support Windows.