Heterogeneous atomic seems not work on WSL2 with GeForce GPU.

Question

Heterogeneous atomic seems not work on WSL2 with GeForce GPU.

SaltyChiang opened this issue a year ago · comments

I want to run QUDA on my GeForce RTX 3080 12GB, and I have successfully built invert_test with the latest commit in Debian testing (WSL2, which is actually a virtual machine on Windows). The test seems to block after printing cublasCreated successfully.

I found that the HETEROGENEOUS_ATOMIC macro enables subroutines in include/targets/cuda/reduce_helper and the while sentence in line 131 never finishes. I tried to set -DQUDA_HETEROGENEOUS_ATOMIC=OFF and then the test gave me a good performance.

This should not be a Windows' or WSL2's problem, because the test works pretty fine on Tesla P100 while using the same environment.

Is this a limitation of GeForce GPUs? Or I just did something wrong?

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux trixie/sid
Release:        n/a
Codename:       trixie

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

$ cuda-g++ --version
cuda-g++ (Debian 12.3.0-8) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Evan Weinberg · Answer 1 · Wed Sep 06 2023 02:43:38 GMT+0800 (China Standard Time)

I think I've hit this in the past (in WSL) and lost track of actually reporting it, I believe it's downstream of some constraints with WSL: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-applications

If it's easy for you (I've since broken my WSL setup...), can you modify lib/targets/cuda/device.cpp around line 97 and have it print the value of the field deviceProp.hostNativeAtomicSupported? My hunch is that will return 0 in WSL (and 1 in proper Linux environments). If that's the case, we should add a runtime check for QUDA_HETEROGENEOUS_ATOMICS && deveProp.hostNativeAtomicSupported == 1.

SaltyChiang · Answer 2 · Wed Sep 06 2023 12:21:35 GMT+0800 (China Standard Time)

Thank you for your reply! I didn't realize that the atomic feature is related to UM.

You are right, deviceProp.hostNativeAtomicSupported is 0 on my machine. But it's still strange that Tesla P100 works fine on WSL2, and I will check the property on P100 later.

Evan Weinberg · Answer 3 · Thu Sep 07 2023 01:39:37 GMT+0800 (China Standard Time)

Oh, that's strange, I misunderstood your original post, I didn't realize it was working in WSL2 with the Tesla P100... so I'm a bit confused there. Ah well! I'll still be interested in seeing what deviceProp.hostNativeAtomicSupported returns on your WSL2+Tesla P100 configuration. If it returns 1 there (for whatever reason I can't necessarily conceive of), that means we at least still have a robust runtime check, even if the reason behind it working on Tesla vs GeForce is a bit of a mystery.

SaltyChiang · Answer 4 · Sun Sep 10 2023 15:58:55 GMT+0800 (China Standard Time)

The problem is more complicated than I thought. I tried to call qudaDeviceSynchronize after TunableKernel::launch_device<Functor, grid_stride>(KERNEL(Reduction2D), tp, stream, arg) in include/tunable_reduction.h and the test passed with good performance comparing with disabling heterogeneous atomic. It seems that the while sentence in the original post happens too early.

deviceProp.hostNativeAtomicSupported returns 0 on Tesla P100, so this issue might not be due to the atomic feature.

maddyscientist · Answer 5 · Wed Mar 13 2024 03:17:14 GMT+0800 (China Standard Time)

Closing this out. For a variety of reasons, Windows does not support cuda::std::atomic when running on Pascal architecture, and Volta upwards is required. As @SaltyChiang has found, when running with Pascal on Windows, one needs to set QUDA_HETEROGENEOUS_ATOMICS=OFF when compiling QUDA.