CUDA on WSL hangs after ~1h training

Question

CUDA on WSL hangs after ~1h training

FremyCompany opened this issue 3 years ago · comments

François REMY commented 3 years ago

Windows Build Number

Microsoft Windows [Version 10.0.22458.1000]

WSL Version

WSL 2
WSL 1

Kernel Version

5.4.91

Distro Version

Ubuntu 20.04

Other Software

No response

Repro Steps

While training DNN models using an NVIDIA GPU using CUDA on WSL2, the training eventually comes to a stop while hanging. This does not result in a crash, so the training is just stuck indefinitely.

Expected Behavior

Running CUDA code in WSL2 should be stable.

Actual Behavior

Running CUDA code in WSL2 results in hang of the CUDA application.

Diagnostic Logs

I have the issue myself, and noticed others face the same issue recently, as evidenced by the following thread on NVIDIA forum:
https://forums.developer.nvidia.com/t/training-wsl-2-cuda-hangs-over-several-training-steps/176225/6

Amil Cengic · Answer 1 · Mon Sep 27 2021 02:15:29 GMT+0800 (China Standard Time)

Windows Build Number

Edition Windows 11 Pro for Workstations Insider Preview
Version Dev
Installed on ‎25.‎9.‎2021.
OS build 22463.1000
Experience Windows Feature Experience Pack 1000.22463.1000.0

Kernel Version

5.10.43.3-microsoft-standard-WSL2

Distro Version

Ubuntu 20.04

Repro Steps

Have same problem on Windows 11 after before last update, last update was hope but still not working. Training just freeze, GPU usage go to 0% but VRAM stay at 90%, RAM usage stay at high usage and CPU usage go to low ~ 15%.
Freeze happens randomly, not only when training is running
I tried cuDNN on CPU only and same happen.
When freeze resources will not free until pc restart or wsl --shutdown. I tried :

killing process using
$ kill -9 -1
(not working)
closing terminal
(not working)
open new terminal and killing processes
(not working)
wsl --shutdown
(works)
after wsl --shutdown I can run instance again but it freeze again in ~ 1-2 hours of using sometimes sooner.

Craig Loewen · Answer 2 · Thu Oct 07 2021 01:19:15 GMT+0800 (China Standard Time)

Could you try updating to the latest kernel version and then see if you still see this issue? We believe that 5.10.60.1 has a fix that might resolve this.

Please run wsl --update to update, and then you can verify your kernel version by running uname -a inside of a Linux instance.

François REMY · Answer 3 · Thu Oct 07 2021 22:44:16 GMT+0800 (China Standard Time)

Though this is difficult to be 100% sure in my case (given long MTBF), it does appear updating the kernel fixed this issue for me.
Thanks for the hint :)

Craig Loewen · Answer 4 · Fri Oct 08 2021 01:59:34 GMT+0800 (China Standard Time)

Great! Well it seems like this is the likely fix here. I'll close this issue, and we can reopen it if this problem comes up again. Thank you for filing this!

Ben Arthur · Answer 5 · Fri Aug 19 2022 05:15:34 GMT+0800 (China Standard Time)

i'm experiencing the hanging issue described above as well. a year later. CUDA 11.7, WSL2, ubuntu 20.0.4. tried wsl --update. for small models it's fine. but for larger ones i'm guessing it runs out of memory in an ungraceful way. this same larger model though works fine on a linux box with the same nvidia card.

elyxlz · Answer 6 · Tue Sep 13 2022 18:32:19 GMT+0800 (China Standard Time)

The update fixed the same issue for me, thank you very much. However, I'm worried it will happen again when training larger models. This link states some limitations Cuda has with WSL when training models:
https://docs.nvidia.com/cuda/wsl-user-guide/index.html

Xingjian Zhen · Answer 7 · Sat Sep 17 2022 09:12:50 GMT+0800 (China Standard Time)

i'm experiencing the hanging issue described above as well. a year later. CUDA 11.7, WSL2, ubuntu 20.0.4. tried wsl --update. for small models it's fine. but for larger ones i'm guessing it runs out of memory in an ungraceful way. this same larger model though works fine on a linux box with the same nvidia card.

Same here, Win11 + WSL2 Ubuntu, train the resnet18 on ImageNet, and it randomly freeze and cannot accept new inputs.

Pedro Figueirêdo · Answer 8 · Sat Oct 07 2023 00:27:35 GMT+0800 (China Standard Time)

Same issue here.
WSL2 stops responding after 40 min of training. Windows never went to sleep.
Not even sure how to debug it.