have problem when resume model

Question

have problem when resume model

zht8506 opened this issue a month ago · comments

Hi, thank you excellent work. i want to run llava-v1-80k with lora. The model can be trained normally.
however, when i resume the checkpoints from work_dir (work_dir/checkpoint-xx), i meeting some error.
I found that my ssh to cluster was killed.

Also, i found that Memory usage exceeds upper limit

So, how to solve this problem, how to correctly resume model.
thank you very much.

Xing Yun (邢云) · Answer 1 · Fri May 10 2024 16:30:20 GMT+0800 (China Standard Time)

Hi, may I ask if there is any update over the checkpoint resuming issue?

zht8506 · Answer 2 · Sat May 11 2024 11:54:44 GMT+0800 (China Standard Time)

Hi, may I ask if there is any update over the checkpoint resuming issue?

i find that it is because the cpu memory exceed the limitation. I solve this problem by only using 6 gpu (CUDA_VISIBLE_DEVICE=0,1,2,3,4,5) in a 8gpu machine.