have problem when resume model
zht8506 opened this issue · comments
Hi, thank you excellent work. i want to run llava-v1-80k with lora. The model can be trained normally.
however, when i resume the checkpoints from work_dir (work_dir/checkpoint-xx), i meeting some error.
I found that my ssh to cluster was killed.
Also, i found that Memory usage exceeds upper limit
So, how to solve this problem, how to correctly resume model.
thank you very much.
Hi, may I ask if there is any update over the checkpoint resuming issue?
Hi, may I ask if there is any update over the checkpoint resuming issue?
i find that it is because the cpu memory exceed the limitation. I solve this problem by only using 6 gpu (CUDA_VISIBLE_DEVICE=0,1,2,3,4,5) in a 8gpu machine.