artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs

Home Page:https://arxiv.org/abs/2305.14314

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: CUDA memory usage in the evaluation phase

LimboWK opened this issue · comments

I have a customized SFT and evaluation scripts using QLora but I got GPU memory not enough problem in the evaluation steps, does anyone have the same issue or any insights on how to reduce the usage in the eval steps.

the trainer and dataset looks like this:

#######################################################################
gradient_accumulation_steps = 4
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
total_train_samples = len(train_data)
total_validation_samples = len(validation_data)
print("*** Total training samples:", total_train_samples)
print("*** Total validation samples:", total_validation_samples)

num_train_steps_per_epoch = (total_train_samples // per_device_train_batch_size // gradient_accumulation_steps)
print('*** num_train_steps_per_epoch: ', num_train_steps_per_epoch)
num_train_epochs = 1
max_steps = int(num_train_epochs * num_train_steps_per_epoch)
print('*** Max steps:', max_steps)

trainer

trainer = transformers.Trainer(
model=model,
train_dataset=train_data,
eval_dataset=validation_data,
compute_metrics=compute_bleu_score,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
args=transformers.TrainingArguments(
per_device_train_batch_size=per_device_train_batch_size,
per_device_eval_batch_size=per_device_eval_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
warmup_steps=2,
max_steps=max_steps,
learning_rate=1e-4,
evaluation_strategy="steps",
eval_steps=50,
save_steps=50,
logging_steps=10,
save_total_limit=2,
fp16=True,
output_dir="outputs",
optim="paged_adamw_8bit"
),
)
model.config.use_cache = False

per_device_train_batch_size = 1

I also encountered this problem. Did you solve it later?