OOM error; mem cost keep growing!

Question

OOM error; mem cost keep growing!

35p32 opened this issue 3 years ago · comments

(Thanks for your excellent work bro ! this paper is really nice and amazing~
When i try to train the lepard myself ( I only change the "batchsize=1" and "num_workers=0"), i found the memory is keep raising(i have 126g mem, lepard almost cost 100g and still raising ), finally it was killed by linux

Have you encountered this kind of problem before?
I did not change your code, i guess this problem come from dataloader? can you give me some suggestions ?

Yang Li · Answer 1 · Fri Dec 24 2021 12:19:01 GMT+0800 (China Standard Time)

Hi, I just tested this code again with bsize=8 on a 80GB GPU card. I did not got a OOM error.
Since you have a much larger GPU mem. It should be fine.
If its the dataloader, you can try to modify the dataset to over-fit only one sample.

35.32 · Answer 2 · Fri Dec 24 2021 15:00:16 GMT+0800 (China Standard Time)

thanks for your replay , i have enough GPU , the problem is the "memory” OOM
when i just loop the dataloader and do nothing else, i found the "memory" keep growing.
May be memory leakage?

35.32 · Answer 3 · Fri Dec 24 2021 17:36:02 GMT+0800 (China Standard Time)

For example, in your "forward" source code , i use memory moniter to debug the memory leckage. I find some problem blew
(Right part is your source code， left part is the memory cost（not gpu cost）for every line, please mention the "Increment " colume which means the increase memory for every forward

When i forward once , [Line 26] increase 3.9M memory ， when i forward twice ,[Line 26] may be increase 3.9 + 3.9 =7.8M memory. When i forward one epoch(about 20000+ times) , [Line 26] increase 80G memory

bitm · Answer 4 · Fri Dec 24 2021 18:36:55 GMT+0800 (China Standard Time)

Does this mean all the functions causes memory leakage?
The training takes 15+ epochs on my machine, and works totally fine.
No idea of what's going on.

35.32 · Answer 5 · Wed Jan 12 2022 11:00:45 GMT+0800 (China Standard Time)

Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.

Sir,Could you please let me know if you can reproduce the result on 3DMatch&3DLomatch?

35.32 · Answer 6 · Wed Jan 12 2022 11:02:24 GMT+0800 (China Standard Time)

Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.

Thank you so much

bitm · Answer 7 · Wed Jan 12 2022 19:16:10 GMT+0800 (China Standard Time)

Does this mean all the functions causes memory leakage? The training takes 15+ epochs on my machine, and works totally fine. No idea of what's going on.

Sir,Could you please let me know if you can reproduce the result on 3DMatch&3DLomatch?

Yes, I can reproduce the results.

chalth · Answer 8 · Mon Apr 24 2023 11:04:37 GMT+0800 (China Standard Time)

Hi， I meet the same problem, did you slove it?

ACuOoOoO · Answer 9 · Fri Jul 14 2023 14:55:46 GMT+0800 (China Standard Time)

I find that the problem is caused by AverageMeter.update() in a training step. I solve the problem by detach the input tensor during the accumulation in AverageMeter.update().