OOM in pretraining

Question

OOM in pretraining

hgzjy25 opened this issue 3 years ago · comments

I tried to pretrain HERO mode from scratch in HowTo100M and TV datasets, and the code worked well at the begining, but crashed after thousands of iterations. I found that the memory usage was growing in training and finally out of memory. Have you met this problem?

Liu0329 · Answer 1 · Mon Jul 26 2021 16:32:31 GMT+0800 (China Standard Time)

I also encounter the same problem. @linjieli222

Linjie Li · Answer 2 · Thu Aug 05 2021 04:16:53 GMT+0800 (China Standard Time)

@Liu0329 @hgzjy25

I have received similar reports about this issue. However, we did not met the same issue during our experiments. You may need to search online for potential solutions, sorry for any inconvenience. If you do find a solution, please also come back and post it here helping other people in need.

Linjie Li · Answer 3 · Thu Aug 05 2021 04:18:57 GMT+0800 (China Standard Time)

One potential direction, check if the memory increasing is due to caching. If so, you can force to clean the cache periodically.