为什么训练阶段的显存一直在往上涨？一会就 OOM 了

Question

为什么训练阶段的显存一直在往上涨？一会就 OOM 了

Amazing-J opened this issue a year ago · comments

CUDA_VISIBLE_DEVICES=4 python baichuan_lora_tuning.py
--tokenized_dataset hc3_chatgpt_zh_specific_qa_baichuan-7B
--lora_rank 4
--per_device_train_batch_size 64
--gradient_accumulation_steps 2
--num_train_epochs 2
--save_steps 200
--save_total_limit 2
--learning_rate 1e-4
--fp16
--remove_unused_columns false
--logging_steps 20
--output_dir weights/hc3_chatgpt_zh_specific_qa_baichuan-7B

beyondguo · Answer 1 · Mon Jul 31 2023 20:29:32 GMT+0800 (China Standard Time)

你这个batch size有点奢侈了，显存也是动态的，如果训练前期的文本都不长，可能显存还够，后面碰到一个超长的文本，可能就不够了