rabbityl / lepard

[CVPR 2022, Oral] Learning Partial point cloud matching in Rigid and Deformable scenes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OOM error; mem cost keep growing!

35p32 opened this issue · comments

commented

(Thanks for your excellent work bro ! this paper is really nice and amazing~
When i try to train the lepard myself ( I only change the "batchsize=1" and "num_workers=0"), i found the memory is keep raising(i have 126g mem, lepard almost cost 100g and still raising ), finally it was killed by linux

Have you encountered this kind of problem before?
I did not change your code, i guess this problem come from dataloader? can you give me some suggestions ?

Hi, I just tested this code again with bsize=8 on a 80GB GPU card. I did not got a OOM error.
Since you have a much larger GPU mem. It should be fine.
If its the dataloader, you can try to modify the dataset to over-fit only one sample.

commented

thanks for your replay , i have enough GPU , the problem is the "memory” OOM
when i just loop the dataloader and do nothing else, i found the "memory" keep growing.
May be memory leakage?

commented

For example, in your "forward" source code , i use memory moniter to debug the memory leckage. I find some problem blew
(Right part is your source code, left part is the memory cost(not gpu cost)for every line, please mention the "Increment " colume which means the increase memory for every forward
bad

When i forward once , [Line 26] increase 3.9M memory , when i forward twice ,[Line 26] may be increase 3.9 + 3.9 =7.8M memory. When i forward one epoch(about 20000+ times) , [Line 26] increase 80G memory

commented

Does this mean all the functions causes memory leakage?
The training takes 15+ epochs on my machine, and works totally fine.
No idea of what's going on.

commented

Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.

Sir,Could you please let me know if you can reproduce the result on 3DMatch&3DLomatch?

commented

Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.

Thank you so much

commented

Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.

Sir,Could you please let me know if you can reproduce the result on 3DMatch&3DLomatch?

Yes, I can reproduce the results.

commented

Hi, I meet the same problem, did you slove it?

I find that the problem is caused by AverageMeter.update() in a training step. I solve the problem by detach the input tensor during the accumulation in AverageMeter.update().