batch_size > 1 results in NaN loss value
K-Mistele opened this issue · comments
Describe the bug
When I set a trainer.batch_size
of > 1 or auto, my loss value is always NaN
, and training will fail and exit at the end of the first epoch. Setting batch_size
to 1 fixes the issue, but results in very inefficient GPU utilization for more powerful GPUs.
To Reproduce
Steps to reproduce the behavior:
Do LoRA training with a trainer.batch_size
of auto
or >= 1:
model_type: llm
base_model: mistralai/Mistral-7B-v0.1
quantization:
bits: 4
adapter:
type: lora
prompt:
template: >-
You are given a premise and a hypothesis below. If the premise entails the hypothesis, return 0. If the premise co>
### Premise: {premise}
### Hypothesis: {hypothesis}
### Label:
input_features:
- name: input # this is a placeholder since we are using a prompt template, it is not expected to match a column.
type: text
output_features:
- name: label
type: text
trainer:
type: finetune
batch_size: 1
enable_gradient_checkpointing: true
epochs: 1
learning_rate: 0.00002
learning_rate_scheduler:
decay: cosine
warmup_fraction: 0.03
reduce_on_plateau: 0
backend:
type: local
generation:
temperature: 0.1
max_new_tokens: 512
preprocessing:
split:
type: random
probabilities: [0.9, 0.05, 0.05]
Expected behavior
I would expect a non-NaN
loss value.
Screenshots
Starting with step 0, epoch: 0
Training: 33%|███▎ | 429/1287 [32:07<1:08:57, 4.82s/it, loss=nan]Found NaN or inf values in parameter 'model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight' of module 'LLM'
NaN or inf tensors found in the model. Stopping training.
Could not load best checkpoint state from /mnt/disk/AI/ludwig/ludwig-lora/results/experiment_run/model/training_checkpoints/best.ckpt. Best checkpoint may not exist.
Traceback (most recent call last):
File "/home/constellate/anaconda3/envs/ludwig/bin/ludwig", line 8, in <module>
sys.exit(main())
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 197, in main
CLI()
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 72, in __init__
getattr(self, args.command)()
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 77, in train
train.cli(sys.argv[2:])
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 395, in cli
train_cli(**vars(args))
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 185, in train_cli
model.train(
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/api.py", line 678, in train
train_stats = trainer.train(
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/trainers/trainer.py", line 1130, in train
raise RuntimeError(error_message)
RuntimeError: Training ran into an error. No checkpoint was saved. This is because training was terminated early due to the presence of NaN or Inf values in the model weights before a single valid checkpoint could be saved.
Environment (please complete the following information):
- OS: Debian 12 (Bookworm)
- Python version: python 3.10.13 (through anaconda)
- Ludwig version: latest (v0.10.0)
Additional context
GPU: 1x Tesla v100 32GB
Hi @K-Mistele! This is actually a known issue that we recently debugged and is actually not specific to Ludwig!
The best way to solve it is to set bnb_4bit_compute_dtype
in the quantisation section of the Ludwig config to bfloat16
instead of float16
since batch sizes of > 1 with mistral in particular lead to bit overflows during training resulting in NaN loss during the first backprop in the train loop.
However, I notice you're training on a V100 and I don't think bfloat16 is supported since it only works on ampere architectures and above? Is there any chance you can use a newer Nvidia GPU?
The only Nvidia GPU that supports the bfloat16
is the A100
which I do not have access to. My v100 is an owned GPU not a rented/cloud one, so I try and stick with that whenever possible since I'm not paying by the hour.
@K-Mistele that makes sense! Actually the entire A series uses Ampere, so you could consider an A5000 from AWS which is pretty cheap. I might also suggest giving the Predibase free trial a try since we have A5000s/A6000s etc (A10Gs) for fine-tuning and we have $25 in free trial credits!
I am planning to I just want to make sure I can use the tool locally first
is there no workaround for a v100?
Unfortunately, not to my knowledge with Mistral. Do you want to test Llama-2-7B instead? The issue doesn't show up there with larger batch sizes!
yeah I can try it
@K-Mistele let me know how it goes!
Do you know if zephyr has the same problem @arnavgarg1 ?
@K-Mistele not to my knowledge!
@K-Mistele Did the fix work?