OSError: [Errno 12] Cannot allocate memory dataloader fills up entire RAM

Question

OSError: [Errno 12] Cannot allocate memory dataloader fills up entire RAM

jeffbaena opened this issue 4 years ago · comments

Dear Clement,

thanks so much for your great code, I am using it since a long time now and it is great!
lately I had to move to a smaller cluster for my trainings. Unfortunately in this scenario we only have 8 CPU cores,128Gb ram and 4 RTX2080 (I am using only one of them).

Your code runs fine in other clusters, but in the new machine I get self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory` after few epochs. This happens if I set any number of workers: I tried 2,4,8. workers =0 is very slow.

It seems related to os.fork() in the dataloader. I notice that the RAM is filled up when this happens.
I have found some other threads on this but no solution. I have tried to add swap ram but it just delays the problem.

I cannot wrap my head around this. Do you have any idea?

thanks,
Stefano

Traceback (most recent call last): File "/home/ssavian/FlowNet_pycharm/main/chess_FWD_ironspeed.py", line 543, in <module> train_pth,note = main() File "/home/ssavian/FlowNet_pycharm/main/chess_FWD_ironspeed.py", line 298, in main EPE = validate(val_loader, model, epoch, output_writers) File "/home/ssavian/FlowNet_pycharm/main/chess_FWD_ironspeed.py", line 510, in validate for i, (input, target) in enumerate(val_loader): File "/home/ssavian/anaconda3/envs/FNC_env_p35_new/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 193, in __iter__ return _DataLoaderIter(self) File "/home/ssavian/anaconda3/envs/FNC_env_p35_new/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 469, in __init__ w.start() File "/home/ssavian/anaconda3/envs/FNC_env_p35_new/lib/python3.5/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/home/ssavian/anaconda3/envs/FNC_env_p35_new/lib/python3.5/multiprocessing/context.py", line 212, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/ssavian/anaconda3/envs/FNC_env_p35_new/lib/python3.5/multiprocessing/context.py", line 267, in _Popen return Popen(process_obj) File "/home/ssavian/anaconda3/envs/FNC_env_p35_new/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/home/ssavian/anaconda3/envs/FNC_env_p35_new/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

Clément Pinard · Answer 1 · Fri Jul 24 2020 23:34:16 GMT+0800 (China Standard Time)

Hi, thanks for your interest in this repo
Do you think it ist a memory leak ? With ram slowly building up to the machine capacity, or is it very sudden at the beginning ? What is your batch size ?

does it still fails with a lower batch size or with smaller images ?

Seems to me that the problem is more related with pytorch in itself.

Stefano Savian · Answer 2 · Fri Jul 24 2020 23:57:10 GMT+0800 (China Standard Time)

It seems like a memory leak, the weird thing here is that with other more powerful machines the RAM is below 10Gb, while with this one it constantly builds up util it's full.
Batch size is 8, I haven't tried smaller batch size yet.

It may be related to pytorch, but I have made a conda environment for your network and the porting has always worked.

Thanks for the prompt reply!

Stefano Savian · Answer 3 · Thu Jul 30 2020 23:08:44 GMT+0800 (China Standard Time)

Thanks for your reply, I have found the bug (hopefully), I was saving some metrics without the .item(). Weird thing this never caused me problems with other servers, but here it did.

Thanks!