possible deadlock in dataloader

Question

possible deadlock in dataloader

zym1010 opened this issue 7 years ago · comments

the bug is described at pytorch/examples#148. I just wonder if this is a bug in PyTorch itself, as the example code looks clean to me. Also, I wonder if this is related to #1120.

Adam Paszke · Answer 1 · Wed Apr 26 2017 04:53:57 GMT+0800 (China Standard Time)

How much free memory do you have when the loader stops?

Yimeng Zhang · Answer 2 · Wed Apr 26 2017 05:02:35 GMT+0800 (China Standard Time)

@apaszke if I check top, the remaining memory (cached mem also counts as used) is usually 2GB. But if you don't count cached as used, it's always a lot, say 30GB+.

Yimeng Zhang · Answer 3 · Wed Apr 26 2017 05:03:42 GMT+0800 (China Standard Time)

Also I don't understand why it always stops at beginning of validation, but not everywhere else.

Natalia Gimelshein · Answer 4 · Wed Apr 26 2017 05:40:27 GMT+0800 (China Standard Time)

Possibly because for validation a separate loader is used that pushes the use of shared memory over the limit.

Yimeng Zhang · Answer 5 · Wed Apr 26 2017 05:52:55 GMT+0800 (China Standard Time)

@ngimel

I just ran the program again. And got stuck.

Output of top:

top - 17:51:18 up 2 days, 21:05,  2 users,  load average: 0.49, 3.00, 5.41
Tasks: 357 total,   2 running, 355 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  0.1 sy,  0.7 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  65863816 total, 60115084 used,  5748732 free,  1372688 buffers
KiB Swap:  5917692 total,      620 used,  5917072 free. 51154784 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                              3067 aalreja   20   0  143332 101816  21300 R  46.1  0.2   1631:44 Xvnc
16613 aalreja   30  10   32836   4880   3912 S  16.9  0.0   1:06.92 fiberlamp                            3221 aalreja   20   0 8882348 1.017g 110120 S   1.3  1.6 579:06.87 MATLAB
 1285 root      20   0 1404848  48252  25580 S   0.3  0.1   6:00.12 dockerd                             16597 yimengz+  20   0   25084   3252   2572 R   0.3  0.0   0:04.56 top
    1 root      20   0   33616   4008   2624 S   0.0  0.0   0:01.43 init

Output of free

yimengzh_everyday@yimengzh:~$ free
             total       used       free     shared    buffers     cached
Mem:      65863816   60122060    5741756    9954628    1372688   51154916
-/+ buffers/cache:    7594456   58269360
Swap:      5917692        620    5917072

Output of nvidia-smi

yimengzh_everyday@yimengzh:~$ nvidia-smi
Tue Apr 25 17:52:38 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 30%   42C    P8    14W / 250W |   3986MiB /  6082MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 0000:81:00.0     Off |                  Off |
|  0%   46C    P0    57W / 235W |      0MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16509    C   python                                        3970MiB |
+-----------------------------------------------------------------------------+

I don't think it's a memory issue.

Adam Paszke · Answer 6 · Wed Apr 26 2017 06:30:26 GMT+0800 (China Standard Time)

There are separate limits for shared memory. Can you try ipcs -lm or cat /proc/sys/kernel/shmall and cat /proc/sys/kernel/shmmax? Also, does it deadlock if you use fewer workers (e.g. test with the extreme case of 1 worker)?

Yimeng Zhang · Answer 7 · Wed Apr 26 2017 06:32:54 GMT+0800 (China Standard Time)

@apaszke

yimengzh_everyday@yimengzh:~$ ipcs -lm

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

yimengzh_everyday@yimengzh:~$ cat /proc/sys/kernel/shmall
18446744073692774399
yimengzh_everyday@yimengzh:~$ cat /proc/sys/kernel/shmmax
18446744073692774399

how do they look for you?

as for fewer workers, I believe it won't happen that often. (I can try now). But I think in practice I need that many workers.

Adam Paszke · Answer 8 · Wed Apr 26 2017 06:43:59 GMT+0800 (China Standard Time)

You have a max of 4096 shared memory segments allowed, maybe that's an issue. You can try increasing that by writing to /proc/sys/kernel/shmmni (maybe try 8192). You may need superuser privileges.

Yimeng Zhang · Answer 9 · Wed Apr 26 2017 06:59:28 GMT+0800 (China Standard Time)

@apaszke well these are default values by both Ubuntu and CentOS 6... Is that really an issue?

Yimeng Zhang · Answer 10 · Wed Apr 26 2017 07:57:56 GMT+0800 (China Standard Time)

@apaszke when running training program, ipcs -a actually shows no shared memory being used. Is that expected?

Yimeng Zhang · Answer 11 · Wed Apr 26 2017 09:06:56 GMT+0800 (China Standard Time)

@apaszke tried running the program (still 22 workers) with following setting on shared mem, and stuck again.

yimengzh_everyday@yimengzh:~$ ipcs -lm

------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

didn't try one worker. first, that would be slow; second, if the problem is really dead locking, then it would definitely disappear.

Adam Paszke · Answer 12 · Wed Apr 26 2017 16:20:02 GMT+0800 (China Standard Time)

@zym1010 default settings doesn't have to be created with such workloads in mind, so yes it might have been an issue. ipcs is for System V shared memory which we aren't using, but I wanted to make sure the same limits don't apply to POSIX shared memory.

It wouldn't definitely disappear, because if the problem is really there, then it's likely a deadlock between the worker and main process, and one worker might be enough to trigger this. Anyway, I can't fix the issue until I can reproduce it. What are the parameters you're using to run the example and did you modify the code in any way? Also, what's the value of torch.__version__? Are you running in docker?

Yimeng Zhang · Answer 13 · Wed Apr 26 2017 20:43:05 GMT+0800 (China Standard Time)

@apaszke Thanks. I understand your analysis much better now.

All other results shown to you up to how are performed on a Ubuntu 14.04 machine with 64GB RAM, dual Xeon, and Titan Black (there's also a K40, but I didn't use it).

The command to generate the problem is CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 22 --batch-size 256 /mnt/temp_drive_3/cv_datasets/ILSVRC2015/Data/CLS-LOC. I didn't modify code at all.

I installed pytorch through pip, on Python 3.5. pytorch version is 0.1.11_5. Not running in Docker.

BTW, I also tried using 1 worker. But I did it on another machine (128GB RAM, dual Xeon, 4 Pascal Titan X, CentOS 6). I ran it using CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 1 --lr 0.01 --workers 1 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC, and the error log is as follows.

Epoch: [0][5003/5005]   Time 2.463 (2.955)      Data 2.414 (2.903)      Loss 5.9677 (6.6311)    Prec@1 3.516 (0.545)    Prec@5 8.594 (2.262)
Epoch: [0][5004/5005]   Time 1.977 (2.955)      Data 1.303 (2.903)      Loss 5.9529 (6.6310)    Prec@1 1.399 (0.545)    Prec@5 7.692 (2.262)
^CTraceback (most recent call last):
  File "main.py", line 292, in <module>
    main()
  File "main.py", line 137, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 210, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()

the top showed the following when stuck with 1 worker.

top - 08:34:33 up 15 days, 20:03,  0 users,  load average: 0.37, 0.39, 0.36
Tasks: 894 total,   1 running, 892 sleeping,   0 stopped,   1 zombie
Cpu(s):  7.2%us,  2.8%sy,  0.0%ni, 89.7%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132196824k total, 131461528k used,   735296k free,   347448k buffers
Swap:  2047996k total,    22656k used,  2025340k free, 125226796k cached

Yimeng Zhang · Answer 14 · Wed Apr 26 2017 23:47:15 GMT+0800 (China Standard Time)

another thing I found is that, if I modified the training code, so that it won't go through all batches, let say, only train 50 batches

if i >= 50:
    break

then the deadlock seems to disappear.

Yimeng Zhang · Answer 15 · Thu Apr 27 2017 12:00:40 GMT+0800 (China Standard Time)

further testing seems to suggest that, this freezing much more frequently happens if I ran the program just after rebooting the computer. After there's some cache in the computer, seems that the frequency of getting this freezing is less.

Adam Paszke · Answer 16 · Thu May 04 2017 06:54:01 GMT+0800 (China Standard Time)

I tried, but I can't reproduce this bug in any way.

Tiancheng Zhi · Answer 17 · Thu May 04 2017 12:06:59 GMT+0800 (China Standard Time)

I met a similar issue: the data loader stops when I finish an epoch and will start a new epoch.

Tiancheng Zhi · Answer 18 · Thu May 04 2017 12:23:58 GMT+0800 (China Standard Time)

Setting num_workers = 0 works. But the program slows down.

Yimeng Zhang · Answer 19 · Tue May 09 2017 12:48:23 GMT+0800 (China Standard Time)

@apaszke have you tried first rebooting the computer and then running the programs? For me, this guarantees the freezing. I just tried 0.12 version, and it's still the same.

One thing I'd like to point out is that I installed the pytorch using pip, as I have a OpenBLAS-linked numpy installed and the MKL from @soumith 's anaconda cloud wouldn't play with it well.

So essentially pytorch is using MKL and numpy is using OpenBLAS. This may not be ideal, but I think this should have nothing to do with the issue here.

Adam Paszke · Answer 20 · Tue May 09 2017 17:11:31 GMT+0800 (China Standard Time)

I looked into it, but I could never reproduce it. MKL/OpenBLAS should be unrelated to this problem. It's probably some problem with a system configuration

Yimeng Zhang · Answer 21 · Tue May 09 2017 21:37:59 GMT+0800 (China Standard Time)

@apaszke thanks. I just tried the python from anaconda official repo and MKL based pytorch. Still the same problem.

Yimeng Zhang · Answer 22 · Thu May 11 2017 06:06:54 GMT+0800 (China Standard Time)

tried running the code in Docker. Still stuck.

Jussi Sainio · Answer 23 · Wed Jun 07 2017 22:35:33 GMT+0800 (China Standard Time)

We have the same problem, running the pytorch/examples imagenet training example (resnet18, 4 workers) inside an nvidia-docker using 1 GPU out of 4. I'll try to gather a gdb backtrace, if I manage to get to the process.

At least OpenBLAS is known to have a deadlock issue in matrix multiplication, which occurs relatively rarely: OpenMathLib/OpenBLAS#937. This bug was present at least in OpenBLAS packaged in numpy 1.12.0.

Yimeng Zhang · Answer 24 · Wed Jun 07 2017 22:59:31 GMT+0800 (China Standard Time)

@jsainio I also tried pure MKL based PyTorch (numpy is linked with MKL as well), and same problem.

Also, this problem is solved (at least for me), if I turn of pin_memory for dataloader.

Jussi Sainio · Answer 25 · Fri Jun 09 2017 17:09:21 GMT+0800 (China Standard Time)

It looks as if two of the workers die out.

During normal operation:

root@b06f896d5c1d:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1 33.2  4.7 91492324 3098288 ?    Ssl  10:51   1:10 python -m runne
user+       58 76.8  2.3 91079060 1547512 ?    Rl   10:54   1:03 python -m runne
user+       59 76.0  2.2 91006896 1484536 ?    Rl   10:54   1:02 python -m runne
user+       60 76.4  2.3 91099448 1559992 ?    Rl   10:54   1:02 python -m runne
user+       61 79.4  2.2 91008344 1465292 ?    Rl   10:54   1:05 python -m runne

after locking up:

root@b06f896d5c1d:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1 24.8  4.4 91509728 2919744 ?    Ssl  14:25  13:01 python -m runne
user+       58 51.7  0.0      0     0 ?        Z    14:27  26:20 [python] <defun
user+       59 52.1  0.0      0     0 ?        Z    14:27  26:34 [python] <defun
user+       60 52.0  2.4 91147008 1604628 ?    Sl   14:27  26:31 python -m runne
user+       61 52.0  2.3 91128424 1532088 ?    Sl   14:27  26:29 python -m runne

For one still remaining workers, the beginning of the gdb stacktrace looks like:

root@b06f896d5c1d:~/mnt# gdb --pid 60
GNU gdb (GDB) 8.0
Attaching to process 60
[New LWP 65]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f36f52af827 in do_futex_wait.constprop ()
   from /lib/x86_64-linux-gnu/libpthread.so.0

(gdb) bt
#0  0x00007f36f52af827 in do_futex_wait.constprop ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f36f52af8d4 in __new_sem_wait_slow.constprop.0 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f36f52af97a in sem_wait@@GLIBC_2.2.5 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f36f157efb1 in semlock_acquire (self=0x7f3656296458,
    args=<optimized out>, kwds=<optimized out>)
    at /home/ilan/minonda/conda-bld/work/Python-3.5.2/Modules/_multiprocessing/semaphore.c:307
#4  0x00007f36f5579621 in PyCFunction_Call (func=
    <built-in method __enter__ of _multiprocessing.SemLock object at remote 0x7f3656296458>, args=(), kwds=<optimized out>) at Objects/methodobject.c:98
#5  0x00007f36f5600bd5 in call_function (oparg=<optimized out>,
    pp_stack=0x7f36c7ffbdb8) at Python/ceval.c:4705
#6  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3236
#7  0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0,
    closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#8  0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#9  0x00007f36f5557542 in function_call (
    func=<function at remote 0x7f36561c7d08>,
    arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
    at Objects/funcobject.c:627
#10 0x00007f36f5524236 in PyObject_Call (
    func=<function at remote 0x7f36561c7d08>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#11 0x00007f36f554077c in method_call (
    func=<function at remote 0x7f36561c7d08>,
    arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
    at Objects/classobject.c:330
#12 0x00007f36f5524236 in PyObject_Call (
    func=<method at remote 0x7f36556f9248>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#13 0x00007f36f55277d9 in PyObject_CallFunctionObjArgs (
    callable=<method at remote 0x7f36556f9248>) at Objects/abstract.c:2445
#14 0x00007f36f55fc3a9 in PyEval_EvalFrameEx (f=<optimized out>,
    throwflag=<optimized out>) at Python/ceval.c:3107
#15 0x00007f36f5601166 in fast_function (nk=<optimized out>, na=1,
    n=<optimized out>, pp_stack=0x7f36c7ffc418,
    func=<function at remote 0x7f36561c78c8>) at Python/ceval.c:4803
#16 call_function (oparg=<optimized out>, pp_stack=0x7f36c7ffc418)
    at Python/ceval.c:4730
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3236
#18 0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=4, kws=0x7f36f5b85060, kwcount=0, defs=0x0, defcount=0,
    kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#19 0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#20 0x00007f36f5557661 in function_call (
    func=<function at remote 0x7f36e14170d0>,
    arg=(<ImageFolder(class_to_idx={'n04153751': 783, 'n02051845': 144, 'n03461385': 582, 'n04350905': 834, 'n02105056': 224, 'n02112137': 260, 'n03938244': 721, 'n01739381': 59, 'n01797886': 82, 'n04286575': 818, 'n02113978': 268, 'n03998194': 741, 'n15075141': 999, 'n03594945': 609, 'n04099969': 765, 'n02002724': 128, 'n03131574': 520, 'n07697537': 934, 'n04380533': 846, 'n02114712': 271, 'n01631663': 27, 'n04259630': 808, 'n04326547': 825, 'n02480855': 366, 'n02099429': 206, 'n03590841': 607, 'n02497673': 383, 'n09332890': 975, 'n02643566': 396, 'n03658185': 623, 'n04090263': 764, 'n03404251': 568, 'n03627232': 616, 'n01534433': 13, 'n04476259': 868, 'n03495258': 594, 'n04579145': 901, 'n04266014': 812, 'n01665541': 34, 'n09472597': 980, 'n02095570': 189, 'n02089867': 166, 'n02009229': 131, 'n02094433': 187, 'n04154565': 784, 'n02107312': 237, 'n04372370': 844, 'n02489166': 376, 'n03482405': 588, 'n04040759': 753, 'n01774750': 76, 'n01614925': 22, 'n01855032': 98, 'n03903868': 708, 'n02422699': 352, 'n01560419': 1...(truncated), kw={}) at Objects/funcobject.c:627
#21 0x00007f36f5524236 in PyObject_Call (
    func=<function at remote 0x7f36e14170d0>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#22 0x00007f36f55fe234 in ext_do_call (nk=1444355432, na=0,
    flags=<optimized out>, pp_stack=0x7f36c7ffc768,
    func=<function at remote 0x7f36e14170d0>) at Python/ceval.c:5034
#23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3275
--snip--

Martin Engilberge · Answer 26 · Fri Jun 09 2017 20:11:19 GMT+0800 (China Standard Time)

I had similar error log, with the main process stuck on: self.data_queue.get()
For me the problem was that I used opencv as image loader. And the cv2.imread function was hanging indefinitely without error on a particular image of imagenet ("n01630670/n01630670_1010.jpeg")

If you said it's working for you with num_workers = 0 it's not that. But I thought it might help some people with similar error trace.

Jussi Sainio · Answer 27 · Fri Jun 09 2017 20:27:23 GMT+0800 (China Standard Time)

I'm running a test with num_workers = 0 currently, no hangs yet. I'm running the example code from https://github.com/pytorch/examples/blob/master/imagenet/main.py. pytorch/vision ImageFolder seems to use PIL or pytorch/accimage internally to load the images, so there's no OpenCV involved.

With num_workers = 4, I can occasionally get the first epoch train and validate fully, and it locks up in the middle of the second epoch. So, it is unlikely a problem in the dataset/loading function.

It looks something like a race condition in ImageLoader which might be triggered relatively rarely by a certain hardware/software combination.

Jussi Sainio · Answer 28 · Fri Jun 09 2017 21:51:50 GMT+0800 (China Standard Time)

@zym1010 thanks for the pointer, I'll try setting pin_memory = False too for the DataLoader.

Jussi Sainio · Answer 29 · Fri Jun 09 2017 22:01:50 GMT+0800 (China Standard Time)

Interesting. On my setup, setting pin_memory = False and num_workers = 4 the imagenet example hangs almost immediately and three of the workers end up as zombie processes:

root@034c4212d022:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1  6.7  2.8 92167056 1876612 ?    Ssl  13:50   0:36 python -m runner
user+       38  1.9  0.0      0     0 ?        Z    13:51   0:08 [python] <defunct>
user+       39  4.3  2.3 91069804 1550736 ?    Sl   13:51   0:19 python -m runner
user+       40  2.0  0.0      0     0 ?        Z    13:51   0:09 [python] <defunct>
user+       41  4.1  0.0      0     0 ?        Z    13:51   0:18 [python] <defunct>

Jussi Sainio · Answer 30 · Fri Jun 09 2017 22:17:06 GMT+0800 (China Standard Time)

In my setup, the dataset lies on a networked disk that is read over NFS. With pin_memory = False and num_workers = 4 I can get the system fail fairly fast.

=> creating model 'resnet18'
- training epoch 0
Epoch: [0][0/5005]	Time 10.713 (10.713)	Data 4.619 (4.619)	Loss 6.9555 (6.9555)	Prec@1 0.000 (0.000)	Prec@5 0.000 (0.000)
Traceback (most recent call last):
--snip--
imagenet_pytorch.main.main([data_dir, "--transient_dir", context.transient_dir])
  File "/home/user/mnt/imagenet_pytorch/main.py", line 140, in main

train(train_loader, model, criterion, optimizer, epoch, args)
  File "/home/user/mnt/imagenet_pytorch/main.py", line 168, in train

for i, (input, target) in enumerate(train_loader):
  File "/home/user/anaconda/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 206, in __next__

idx, batch = self.data_queue.get()
  File "/home/user/anaconda/lib/python3.5/multiprocessing/queues.py", line 345, in get

return ForkingPickler.loads(res)
  File "/home/user/anaconda/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd

fd = df.detach()
  File "/home/user/anaconda/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach

with _resource_sharer.get_connection(self._id) as conn:
  File "/home/user/anaconda/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection

c = Client(address, authkey=process.current_process().authkey)
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 493, in Client

answer_challenge(c, authkey)
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 732, in answer_challenge

message = connection.recv_bytes(256)         # reject large message
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes

buf = self._recv_bytes(maxlength)
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes

buf = self._recv(4)
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 379, in _recv

chunk = read(handle, remaining)
ConnectionResetError
: 
[Errno 104] Connection reset by peer

@zym1010 do you happen to have a networked disk or a traditional spinning disk as well which might be slower in latency/etc.?

Yimeng Zhang · Answer 31 · Fri Jun 09 2017 22:20:23 GMT+0800 (China Standard Time)

@jsainio

I'm using a local SSD on the compute node of cluster.The code is on a NFS drive, but the data is on the local SSD, for maximal loading speed. Never tried loading data on NFS drives.

Jussi Sainio · Answer 32 · Fri Jun 09 2017 22:24:59 GMT+0800 (China Standard Time)

@zym1010 Thanks for the info. I'm running this too on a compute node of a cluster.

Actually, I'm running the num_workers = 0 experiment on the same node at the same time while trying the num_workers = 4 variations. It might be that the first experiment is generating enough load so that possible race conditions manifest themselves faster in the latter.

Jussi Sainio · Answer 33 · Fri Jun 09 2017 22:33:05 GMT+0800 (China Standard Time)

@apaszke When you tried to reproduce this previously, did you happen to try running two instances side-by-side or with some significant other load on the system?

Adam Paszke · Answer 34 · Fri Jun 09 2017 22:36:48 GMT+0800 (China Standard Time)

@jsainio Thanks for investigating this! That's weird, workers should only exit together, and once the main process is done reading the data. Can you try to inspect why do they exit prematurely? Maybe check the kernel log (dmesg)?

Adam Paszke · Answer 35 · Fri Jun 09 2017 22:37:10 GMT+0800 (China Standard Time)

No, I haven't tried that, but it seemed to appear even when that wasn't the case IIRC

Jussi Sainio · Answer 36 · Fri Jun 09 2017 22:41:17 GMT+0800 (China Standard Time)

@apaszke Ok, good to know that the workers should not have exited.

I've tried but I don't know a good way to check why they exit. dmesg does not show anything relevant. (I'm running in a Ubuntu 16.04-derived Docker, using Anaconda packages)

Adam Paszke · Answer 37 · Fri Jun 09 2017 22:50:12 GMT+0800 (China Standard Time)

One way would be to add a number of prints inside the worker loop. I have no idea why do they silently exit. It's probably not an exception, because it would have been printed to stderr, so they either break out of the loop, or get killed by the OS (perhaps by a signal?)

Natalia Gimelshein · Answer 38 · Sat Jun 10 2017 00:08:07 GMT+0800 (China Standard Time)

@jsainio, just to make sure, are you running docker with --ipc=host (you don't mention this)? Can you check the size of your shared memory segment (df -h | grep shm)?

Jussi Sainio · Answer 39 · Mon Jun 12 2017 13:12:56 GMT+0800 (China Standard Time)

@ngimel I'm using --shm-size=1024m. df -h | grep shm reports accordingly:

root@db92462e8c19:~/mnt# df -h | grep shm
shm                                                          1.0G  883M  142M  87% /dev/shm

That usage seems rather high tough. This is on a docker with two zombie workers.

Adam Paszke · Answer 40 · Thu Jun 15 2017 02:02:18 GMT+0800 (China Standard Time)

Can you try increasing shm size? I just checked and on the server where I tried to reproduce the problems it was 16GB. You either change the docker flag or run

mount -o remount,size=8G /dev/shm

Adam Paszke · Answer 41 · Thu Jun 15 2017 02:07:13 GMT+0800 (China Standard Time)

I just tried decreasing the size to 512MB, but I got a clear error instead of a deadlock. Still can't reproduce 😕

Natalia Gimelshein · Answer 42 · Thu Jun 15 2017 02:10:39 GMT+0800 (China Standard Time)

With docker we tend to get deadlocks when shm is not enough, rather than clear error messages, don't know why. But it is usually cured by increasing shm (and I did get deadlocks with 1G).

Adam Paszke · Answer 43 · Thu Jun 15 2017 02:12:03 GMT+0800 (China Standard Time)

Ok, it seems that with 10 workers an error is raised, but when I use 4 workers I get a deadlock at 58% of /dev/shm usage! I finally reproduced it

greaber · Answer 44 · Thu Jun 15 2017 02:29:51 GMT+0800 (China Standard Time)

That's great that you can reproduce a form of this problem. I posted a script that triggers a hang in #1579, and you replied that it didn't hang on your system. I had actually only tested it on my MacBook. I just tried on Linux, and it didn't hang. So if you only tried on Linux, it might also be worth trying on a Mac.

Adam Paszke · Answer 45 · Thu Jun 15 2017 07:33:49 GMT+0800 (China Standard Time)

Ok, so after investigating the problem it seems to be a weird issue. Even when I limit /dev/shm to be only 128MB large, Linux is happy to let us create 147MB files there, mmap them fully in memory, but will send a deadly SIGBUS to the worker once it actually tries to access the pages... I can't think of any mechanism that would allow us to check validity of the pages except for iterating over them, and touching each one, with a SIGBUS handler registered...

A workaround for now is to expand /dev/shm with the mount command as I shown above. Try with 16GB (ofc if you have enough RAM).

Adam Paszke · Answer 46 · Thu Jun 15 2017 07:48:42 GMT+0800 (China Standard Time)

It's hard to find any mentions of this, but here's one.

Clément Pinard · Answer 47 · Thu Jun 15 2017 17:03:27 GMT+0800 (China Standard Time)

Thanks for your time about this issue, it has been driving me nuts for a long time! If I understand correctly I need to expand /dev/shm to be 16G instead of 8G. It makes sens but when when in try df -h, I can see that all my ram is actually allocated as such : (I have 16G)

tmpfs              7,8G    393M  7,4G   5% /dev/shm
tmpfs              5,0M    4,0K  5,0M   1% /run/lock
tmpfs              7,8G       0  7,8G   0% /sys/fs/cgroup
tmpfs              1,6G     60K  1,6G   1% /run/user/1001

This is the output of df -h during a deadlock. As far as I understand, If I have a SWAP partition of 16G, I can mount tmpfs up to 32G, so it shouldn't be a problem to expand /dev/shm, right ?

More importantly, I am puzzled about the cgroup partition and its purpose since it takes nearly a half of my RAM. Apparently it's designed to manage efficiently multi-processor tasks, but I'm really not familiar with what it does and why we need it, would it change something to allocate all of physical RAM to shm (because we set its size to 16G) and put it in SWAP (although i believe both will be partly in the RAM and SWAP simultaneously)

Jussi Sainio · Answer 48 · Thu Jun 15 2017 18:40:07 GMT+0800 (China Standard Time)

@apaszke Thanks! Great that you found the underlying cause. I was occasionally getting both various "ConnectionReset" errors and deadlocks with docker --shm-size=1024m depending what other load there was one the machine. Testing now with --shm-size=16384m and 4 workers.

Adam Paszke · Answer 49 · Thu Jun 15 2017 19:20:30 GMT+0800 (China Standard Time)

@jsainio ConnectionReset might have been caused by the same thing. The processes started exchanging some data, but once shm ran out of space a SIGBUS was sent to the worker and killed it.

@ClementPinard as far as I understand you can make it as large as you want, except that it will likely freeze your machine once you run out of RAM (because even kernel can't free this memory). You probably don't need to bother about /sys/fs/cgroup. tmpfs partitions allocate memory lazily, so as long as the usage stays at 0B, it doesn't cost you anything (including limits). I don't think using swap is a good idea, as it will make the data loading muuuuch slower, so you can try increasing the shm size to say 12GB, and limiting the number of workers (as I said, don't use all your RAM for shm!). Here's a nice writeup on tmpfs from the kernel documentation.

I don't know why the deadlock happen even when /dev/shm usage is very small (happens at 20kB on my machine). Perhaps the kernel is overly optimistic, but doesn't wait until you fill it all, and kills the process once it starts using anything from this region.

Clément Pinard · Answer 50 · Thu Jun 15 2017 19:59:44 GMT+0800 (China Standard Time)

Testing now with 12G and half the workers I had, and it failed :(
It was working like a charm in lua torch version (same speed, same number of workers) , which makes me wonder if the problem is only /dev/shmrelated and not closer to python multiprocessing...

The odd thing about it (as you mentionned) is that /dev/shmis never close to be full. During first training epoch, it never went above 500Mo. And It also never locks during first epoch, and if I shut down testing trainloader never fails across all the epochs. The deadlock seems to only appear when beginning test epoch. I should keep track of /dev/shm when going from train to test, maybe there is a peak usage during dataloaders changing.

Yimeng Zhang · Answer 51 · Thu Jun 15 2017 20:38:49 GMT+0800 (China Standard Time)

@ClementPinard even with higher shared memory, and without Docker, it can still fail.

Adam Paszke · Answer 52 · Thu Jun 15 2017 20:40:46 GMT+0800 (China Standard Time)

If torch version == Lua Torch, then it still might be related to /dev/shm. Lua Torch can use threads (there's no GIL), so it doesn't need to go through shared mem (they all share a single address space).

Pratik Chaudhari · Answer 53 · Fri Jul 07 2017 14:55:05 GMT+0800 (China Standard Time)

I had the same issue where the dataloader crashes after complaining that it could not allocate memory at the beginning of a new training or validation epoch. The solutions above did not work for me (i) my /dev/shm is 32GB and it was never used more than 2.5GB, and (ii) setting pin_memory=False did not work.

This is perhaps something to do with garbage collection? My code looks roughly like the following. I need an infinite iterator and hence I do a try / except around the next() below :-)

def train():
    train_iter = train_loader.__iter__()
    for i in xrange(max_batches):
        try:
            x, y = next(train_iter)
        except StopIteration:
            train_iter = train_loader.__iter__()
        ...
    del train_iter

train_loader is a DataLoader object. Without the explicit del train_iter line at the end of the function, the process always crashes after 2-3 epochs (/dev/shm still shows 2.5 GB). Hope this helps!

I am using 4 workers (version 0.1.12_2 with CUDA 8.0 on Ubuntu 16.04).

zhengyun · Answer 54 · Fri Aug 04 2017 08:57:47 GMT+0800 (China Standard Time)

I also met the deadlock, especially when the work_number is large. Is there any possible solution for this problem? My /dev/shm size is 32GB, with cuda 7.5, pytorch 0.1.12 and python 2.7.13. The following is related info after death. It seems related to memory. @apaszke

Yimeng Zhang · Answer 55 · Sat Aug 05 2017 01:22:04 GMT+0800 (China Standard Time)

@zhengyunqq try pin_memory=False if you set it to True. Otherwise, I'm not aware of any solution.

Dan Hendrycks · Answer 56 · Fri Aug 11 2017 23:00:54 GMT+0800 (China Standard Time)

I have also met the deadlock when num_workers is large.

Vadim Kantorov · Answer 57 · Thu Aug 17 2017 05:48:35 GMT+0800 (China Standard Time)

For me, the problem was that if a worker thread dies for whatever reason, then index_queue.put hangs forever. One reason of working threads dying is unpickler failing during initialization. In that case, until this Python bugfix in master in May 2017, the worker thread would die and cause the endless hang. In my case, the hang was happening in batch pre-fetching priming stage.

Maybe a replacement of SimpleQueue used in DataLoaderIter by Queue which allows for a timeout with a graceful exception message.

UPD: I was mistaken, this bugfix patches Queue, not SimpleQueue. It's still true that SimpleQueue will lock if no worker threads are online. An easy way to check that is replacing these lines with self.workers = [].

xfanplus · Answer 58 · Fri Sep 08 2017 13:39:17 GMT+0800 (China Standard Time)

i have the same problem, and i can't change shm(without permission), maybe it's better to use Queue or something else?

Andreas Doering · Answer 59 · Thu Sep 14 2017 01:13:15 GMT+0800 (China Standard Time)

I have a similar problem.
This code will freeze and never print anything. If I set num_workers=0 it will work though

dataloader = DataLoader(transformed_dataset, batch_size=2, shuffle=True, num_workers=2)
model.cuda()
for i, batch in enumerate(dataloader):
 print(i)

If I put model.cuda() behind the loop, everything will run fine.

dataloader = DataLoader(transformed_dataset, batch_size=2, shuffle=True, num_workers=2)

for i, batch in enumerate(dataloader):
 print(i)
model.cuda()

Does anyone have a solution for that problem?

WendyShang · Answer 60 · Thu Sep 21 2017 03:00:39 GMT+0800 (China Standard Time)

I have run into similar issues as well while training ImageNet. It will hang at the 1st iteration of evaluation consistently on certain servers with certain architecture (and not on other servers with the same architecture or the same server with different architecture), but always the 1st iter during eval on validation. When I was using Torch, we found nccl can cause deadlock like this, is there way to turn it off?

zoharli · Answer 61 · Mon Oct 23 2017 15:29:01 GMT+0800 (China Standard Time)

I'm facing the same issue,randomly getting stuck at the start of 1st epoch.All the workarounds mentioned above don't work for me.When Ctrl-C is pressed, it prints these:

Traceback (most recent call last):
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 44, in _worker_loop
    data_queue.put((idx, samples))
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/queues.py", line 354, in put
    self._writer.send_bytes(obj)
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
    self._send(buf)
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
KeyboardInterrupt
Traceback (most recent call last):
  File "scripts/train_model.py", line 640, in <module>
    main(args)
  File "scripts/train_model.py", line 193, in main
    train_loop(args, train_loader, val_loader)
  File "scripts/train_model.py", line 341, in train_loop
    ee_optimizer.step()
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/site-packages/torch/optim/adam.py", line 74, in step
    p.data.addcdiv_(-step_size, exp_avg, denom)
KeyboardInterrupt

paulguerrero · Answer 62 · Mon Oct 23 2017 20:48:10 GMT+0800 (China Standard Time)

I had a similar problem of having a deadlock with a single worker inside docker and I can confirm that it was the shared memory issue in my case. By default docker only seems to allocate 64MB of shared memory, however I needed 440MB for 1 worker, which probably caused the behavior described by @apaszke.

Zhao Yilong · Answer 63 · Tue Oct 24 2017 10:34:45 GMT+0800 (China Standard Time)

I am being troubled by the same problem, yet I'am under a different environment from most others in this thread, so maybe my inputs can help locating the underlying cause. My pytorch is installed using the excellent conda package built by peterjc123 under Windows10.

I am running some cnn on the cifar10 dataset. For the dataloaders, num_workers is set to 1. Although having num_workers > 0 is known to cause BrokenPipeError and advised against in #494, what I am experiencing is not BrokenPipeError but some memory allocation error. The error always occurred at around 50 epochs, right after the validation of the last epoch and before the start of training for the next epoch. 90% of the time it's precisely 50 epochs, other times it will be off by 1 or 2 epochs. Other than that everything else is pretty consistent. Setting num_workers=0 will eliminate this problem.

yjzhux · Answer 64 · Tue Oct 24 2017 10:52:30 GMT+0800 (China Standard Time)

@paulguerrero is right. I solved this problem by increasing the shared memory from 64M to 2G. Maybe it's useful to docker users.

peterjc123 · Answer 65 · Wed Oct 25 2017 14:14:35 GMT+0800 (China Standard Time)

@berzjackson That's a known bug in the conda package. Fixed in the latest CI builds.

Jeremy Howard · Answer 66 · Thu Nov 02 2017 01:27:44 GMT+0800 (China Standard Time)

We have ~600 people that started a new course that uses Pytorch on Monday. A lot of folks on our forum are reporting this problem. Some on AWS P2, some on their own systems (mainly GTX 1070, some Titan X).

When they interrupt training the end of the stack trace shows:

~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

We have num_workers=4, pin_memory=False. I've asked them to check their shared memory settings - but is there anything I can do (or we could do in Pytorch) to make this problem go away? (Other than reducing num_workers, since that would slow things down quite a bit.)

apiltamang · Answer 67 · Thu Nov 02 2017 01:52:49 GMT+0800 (China Standard Time)

I'm in the class @jph00 (thanks Jeremy! :) ) referred to. I tried using "num_workers=0" as well. Still get the same error where resnet34 loads very slowly. The fitting is also very slow. But weird thing: this only happens once in the lifetime of a notebook session.

In other words, once the data is loaded, and the fitting is run once, I can move around and keep repeating the steps... even with 4 num_workers, and everything seems to work fast as expected in a GPU.

I'm on PyTorch 0.2.0_4, Python 3.6.2, Torchvision 0.1.9, Ubuntu 16.04 LTS. Doing "df -h" on my terminal says that I've 16GBs on /dev/shm, although the utilization was very low.

Here's a screenshot of where the loading fails (note I've used num_workers=0 for the data)
(sorry about the small letters. I had to zoom out to capture everything...)

Jeremy Howard · Answer 68 · Thu Nov 02 2017 01:57:27 GMT+0800 (China Standard Time)

@apiltamang I'm not sure that's the same issue - it doesn't sound like the same symptoms at all. Best for us to diagnose that on the fast.ai forum, not here.

Soumith Chintala · Answer 69 · Thu Nov 02 2017 02:12:18 GMT+0800 (China Standard Time)

looking into this ASAP!

Jeremy Howard · Answer 70 · Thu Nov 02 2017 02:15:34 GMT+0800 (China Standard Time)

@soumith I've given @apaszke access to the course's private forum and I've asked students with the problem to give us access to login to their box.

Tongzhou Wang · Answer 71 · Thu Nov 02 2017 02:27:41 GMT+0800 (China Standard Time)

@jph00 Hi Jeremy, did any of the students try increasing shm as @apaszke mentioned above? Was that helpful?

Jeremy Howard · Answer 72 · Thu Nov 02 2017 04:49:03 GMT+0800 (China Standard Time)

@ssnl one of the students has confirmed they've increased shared memory, and still have the problem. I've asked some others to confirm too.

Tongzhou Wang · Answer 73 · Thu Nov 02 2017 04:52:02 GMT+0800 (China Standard Time)

@jph00 Thanks! I successfully reproduced the hang due to low shared memory. If the issue lies in elsewhere I'll have to dig deeper! Do you mind share the script with me?

Jeremy Howard · Answer 74 · Thu Nov 02 2017 04:59:06 GMT+0800 (China Standard Time)

Sure - here's the notebook we're using: https://github.com/fastai/fastai/blob/master/courses/dl1/lesson1.ipynb . The students have noticed that the problem only occurs when they run all the cells in the order they're in the notebook. Hopefully the notebook is self-explanatory, but let me know if you have any trouble running it - it includes a link to download the necessary data.

Based on the shared memory issue you could replicate, is there any kind of workaround I could add to our library or notebook that would avoid it?

Tongzhou Wang · Answer 75 · Thu Nov 02 2017 05:48:06 GMT+0800 (China Standard Time)

@jph00 Diving into the code right now. I'll try to spot ways to reduce shared memory usage. It doesn't seem that the script should use large amount of shm, so there is hope!

I'll also send out a PR to show a nice error message upon hitting shm limit rather than just letting it hang.

Jeremy Howard · Answer 76 · Thu Nov 02 2017 05:50:05 GMT+0800 (China Standard Time)

OK I've replicated the problem on a fresh AWS P2 instance using their CUDA 9 AMI with latest Pytorch conda install. If you provide your public key, I can give you access to try it out directly. My email is the first letter of my first name at fast.ai

Tongzhou Wang · Answer 77 · Thu Nov 02 2017 05:53:12 GMT+0800 (China Standard Time)

@jph00 Just sent you an email :) thanks!

Tongzhou Wang · Answer 78 · Thu Nov 02 2017 06:25:06 GMT+0800 (China Standard Time)

@jph00 And FYI, the script took 400MB shared memory on my box. So it'd be great for students who had this issue to check they have enough free shm.

Jeremy Howard · Answer 79 · Thu Nov 02 2017 09:32:36 GMT+0800 (China Standard Time)

OK so I've figured out the basic issue, which is that opencv and Pytorch multiprocessing don't play well together, sometimes. No problems on our box at university, but lots of problems on AWS (on the new deep learning CUDA 9 AMI with P2 instance). Adding locking around all cv2 calls doesn't fix it, and adding cv2.setNumThreads(0) doesn't fix. This seems to fix it:

from multiprocessing import set_start_method
set_start_method('spawn')

However that impacts performance by about 15%. The recommendation in the opencv github issue is to use https://github.com/tomMoral/loky . I've used that module before and found it rock-solid. Not urgent, since we've got a solution that works well enough for now - but might be worth considering using Loky for Dataloader?

Perhaps more importantly, it would be nice if at least there was some kind of timeout in pytorch's queue so that these infinite hangs would get caught.

Jeremy Howard · Answer 80 · Thu Nov 02 2017 11:04:48 GMT+0800 (China Standard Time)

FYI, I just tried a different fix, since 'spawn' was making some parts 2-3x slower - which is that I added a few random sleeps in sections that iterate through the dataloader quickly. That also fixed the problem - although perhaps not ideal!

Tongzhou Wang · Answer 81 · Thu Nov 02 2017 11:53:39 GMT+0800 (China Standard Time)

Thanks for digging into this! Glad to know that you've found two workarounds. Indeed it would be good to add timeouts on indexing into datasets. We will discuss and get back to you on that route tomorrow.

cc @soumith is loky something we want to investigate?

Tongzhou Wang · Answer 82 · Thu Nov 02 2017 11:54:25 GMT+0800 (China Standard Time)

For people who come to this thread for above discussion, the opencv issue is discussed in greater depth at opencv/opencv#5150

Jeremy Howard · Answer 83 · Thu Nov 02 2017 22:34:22 GMT+0800 (China Standard Time)

OK I seem to have a proper fix for this now - I've rewritten Dataloader to user ProcessPoolExecutor.map() and moved the creation of the tensor into the parent process. The result is faster than I was seeing with the original Dataloader, and it's been stable on all the computers I've tried it on. The code is also a lot simpler.

If anyone is interested in using it, you can get it from https://github.com/fastai/fastai/blob/master/fastai/dataloader.py .

The API is the same as the standard version, except that your Dataset must not return a Pytorch tensor - it should return numpy arrays or python lists. I haven't made any attempt to make it work on older Pythons, so I wouldn't be surprised if there's some issues there.

(The reason I've gone down this path is that I found when doing a lot of image processing/augmentation on recent GPUs that I couldn't complete the processing fast enough to keep the GPU busy, if I did the preprocessing using Pytorch CPU operations; however using opencv was much faster, and I was able to fully utilize the GPU as a result.)

Adam Paszke · Answer 84 · Thu Nov 02 2017 23:09:53 GMT+0800 (China Standard Time)

Oh if it's an opencv issue then there's not a lot we can do about it. It's true that forking is dangerous when you have thread pools. I don't think we want to add a runtime dependency (currently we have none), especially that it won't handle PyTorch tensors nicely. It would be better to just figure out what's causing the deadlocks and @ssnl is on it.

@jph00 have you tried Pillow-SIMD? It should work with torchvision out of the box and I have heard many good things about it.

Jeremy Howard · Answer 85 · Thu Nov 02 2017 23:46:30 GMT+0800 (China Standard Time)

Yes I know pillow-SIMD well. It only speeds up resize, blur, and RGB conversion.

I don't agree there's not a lot you can do here. It's not exactly an opencv issue (they don't claim to support this type of python multiprocessing more generally, let alone pytorch's special-cased multi-processing module) and not exactly a Pytorch issue either. But the fact that Pytorch silently waits for ever without giving any kind of error is (IMO) something you can fix, and more generally a lot of smart folks have been working hard over the last few years to create improved multiprocessing approaches which avoid problems just like this one. You could borrow from the approaches they use without bringing in an external dependency.

Olivier Grisel, who is one of the folks behind Loky, has a great slide deck summarizing the state of multiprocessing in Python: http://ogrisel.github.io/decks/2017_euroscipy_parallelism/

I don't mind either way, since I've now written a new Dataloader that doesn't have the problem. But I do, FWIW, suspect that interactions between pytorch's multiprocessing and other systems will be an issue for other folks too in the future.

Kimmy · Answer 86 · Fri Nov 17 2017 03:32:35 GMT+0800 (China Standard Time)

For what it's worth, I had this issue on Python 2.7 on ubuntu 14.04. My data loader read from a sqlite database and worked perfectly with num_workers=0, sometimes seemed OK with num_workers=1, and very quickly deadlocked for any higher value. Stack traces showed the process hung in recv_bytes.

Things that didn't work:

Passing --shm-size 8G or --ipc=host when launching docker
Running echo 16834 | sudo tee /proc/sys/kernel/shmmni to increase the number of shared memory segments (the default was 4096 on my machine)
Setting pin_memory=True or pin_memory=False, neither one helped

The thing that reliably fixed my issue was porting my code to Python 3. Launching the same version of Torch inside a Python 3.6 instance (from Anaconda) completely fixed my issue and now data loading doesn't hang anymore.

Jeremy Howard · Answer 87 · Sat Nov 18 2017 13:48:10 GMT+0800 (China Standard Time)

@apaszke here's why working well with opencv is important, FYI (and why torchsample isn't a great option - it can handle rotation of <200 images/sec!):

Umar Iqbal · Answer 88 · Sat Dec 09 2017 09:05:12 GMT+0800 (China Standard Time)

Did anyone find a solution to this problem?

elbaro · Answer 89 · Thu Dec 14 2017 16:43:31 GMT+0800 (China Standard Time)

@Iqbalu Try the script above: https://github.com/fastai/fastai/blob/master/fastai/dataloader.py
It solved my issue but it doesn't support num_workers=0.

Umar Iqbal · Answer 90 · Thu Dec 14 2017 17:12:40 GMT+0800 (China Standard Time)

@elbaro actually I tried it and in my case it was not using multiple workers at all. Did you change anything there?

Adam Paszke · Answer 91 · Thu Dec 14 2017 19:09:33 GMT+0800 (China Standard Time)

@Iqbalu fast.ai data loader never spawns worker processes. It only uses threads, so they might not show up in some tools

Umar Iqbal · Answer 92 · Fri Dec 15 2017 12:00:52 GMT+0800 (China Standard Time)

@apaszke @elbaro @jph00 The data loader from fast.ai slowed down data reading by more than 10x. I am using num_workers=8. Any hint what could be the reason?

Adam Paszke · Answer 93 · Fri Dec 15 2017 15:45:08 GMT+0800 (China Standard Time)

It's likely data loader uses packages that don't give up the GIL

Umar Iqbal · Answer 94 · Thu Dec 28 2017 12:53:37 GMT+0800 (China Standard Time)

@apaszke any idea why the usage of shared-memory keeps increasing after some epochs. In my case, it starts with 400MB and then every ~20th epoch increases by 400MB. Thanks!

Adam Paszke · Answer 95 · Thu Dec 28 2017 23:16:24 GMT+0800 (China Standard Time)

@Iqbalu not really. That shouldn't be happening

Remi · Answer 96 · Fri Jan 19 2018 08:40:22 GMT+0800 (China Standard Time)

I tried many things and cv2.setNumThreads(0) finally solved my issue.

Thanks @jph00

Roy · Answer 97 · Thu Jan 25 2018 13:58:48 GMT+0800 (China Standard Time)

I have been troubled by this problem recently. cv2.setNumThreads(0) doesn't work for me. I even change all cv2 code to use scikit-image instead, but the problem still exists. Besides, I have 16G for /dev/shm. I only have this problem when using multiple gpus. Every thing works fine on single gpu. Do anyone has any new thoughts on the solution?

Jack · Answer 98 · Sat Jan 27 2018 16:48:41 GMT+0800 (China Standard Time)

Same Error. I have this problem when using single gpu.

shacharf · Answer 99 · Sun Jan 28 2018 17:57:10 GMT+0800 (China Standard Time)

For me disabling opencv threads solved the problem:
cv2.setNumThreads(0)

tianq01 · Answer 100 · Thu Feb 01 2018 10:35:50 GMT+0800 (China Standard Time)

hit it too with pytorch 0.3, cuda 8.0, ubuntu 16.04
no opencv used.