Is there a way to avoid the thread lock in cuda driver?
coreylowman opened this issue · comments
Okay, after too much time investigating. It seems the Cuda Driver (not NCCL) is using a global MUTEX which makes multithread/multigpu quite useless.
https://forums.developer.nvidia.com/t/cuda-wont-concurrently-run-kernels-on-multiple-devices-from-within-same-process/240388
https://forums.developer.nvidia.com/t/multithreaded-tensorrt-performance-drops-dramatically/184882/8
https://forums.developer.nvidia.com/t/cuda-introduces-heavy-locks/61357Basically all threads will contend for that lock not enabling the CPU to send kernels fast enough.
If there's any way to avoid that mutex as it makes the code significantly simpler to be multi thread rather than multiprocess.
But the tone in the related issues makes me think it's a known issue and nothing is going to be done about it.Edit: I updated the code and internals to reflect that hopefully saving future devs from being bitten in the same way.
Originally posted by @Narsil in #164 (comment)