OOM error; mem cost keep growing!
35p32 opened this issue · comments
(Thanks for your excellent work bro ! this paper is really nice and amazing~
When i try to train the lepard myself ( I only change the "batchsize=1" and "num_workers=0"), i found the memory is keep raising(i have 126g mem, lepard almost cost 100g and still raising ), finally it was killed by linux
Have you encountered this kind of problem before?
I did not change your code, i guess this problem come from dataloader? can you give me some suggestions ?
Hi, I just tested this code again with bsize=8 on a 80GB GPU card. I did not got a OOM error.
Since you have a much larger GPU mem. It should be fine.
If its the dataloader, you can try to modify the dataset to over-fit only one sample.
thanks for your replay , i have enough GPU , the problem is the "memory” OOM
when i just loop the dataloader and do nothing else, i found the "memory" keep growing.
May be memory leakage?
For example, in your "forward" source code , i use memory moniter to debug the memory leckage. I find some problem blew
(Right part is your source code, left part is the memory cost(not gpu cost)for every line, please mention the "Increment " colume which means the increase memory for every forward
When i forward once , [Line 26] increase 3.9M memory , when i forward twice ,[Line 26] may be increase 3.9 + 3.9 =7.8M memory. When i forward one epoch(about 20000+ times) , [Line 26] increase 80G memory
Does this mean all the functions causes memory leakage?
The training takes 15+ epochs on my machine, and works totally fine.
No idea of what's going on.
Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.
Sir,Could you please let me know if you can reproduce the result on 3DMatch&3DLomatch?
Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.
Thank you so much
Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.
Sir,Could you please let me know if you can reproduce the result on 3DMatch&3DLomatch?
Yes, I can reproduce the results.
Hi, I meet the same problem, did you slove it?
I find that the problem is caused by AverageMeter.update() in a training step. I solve the problem by detach the input tensor during the accumulation in AverageMeter.update().