haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Home Page:https://llava.hliu.cc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

have problem when resume model

zht8506 opened this issue · comments

Hi, thank you excellent work. i want to run llava-v1-80k with lora. The model can be trained normally.
however, when i resume the checkpoints from work_dir (work_dir/checkpoint-xx), i meeting some error.
I found that my ssh to cluster was killed.
image
Also, i found that Memory usage exceeds upper limit
image
So, how to solve this problem, how to correctly resume model.
thank you very much.

Hi, may I ask if there is any update over the checkpoint resuming issue?

Hi, may I ask if there is any update over the checkpoint resuming issue?

i find that it is because the cpu memory exceed the limitation. I solve this problem by only using 6 gpu (CUDA_VISIBLE_DEVICE=0,1,2,3,4,5) in a 8gpu machine.