CUDA DeviceAllocate segfault

Question

CUDA DeviceAllocate segfault

drzraf opened this issue 3 months ago · comments

#0  0x00007bc0622c6554 in std::_Rb_tree_increment(std::_Rb_tree_node_base const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#1  0x00007bc05573e59a in cub::CachingDeviceAllocator::DeviceAllocate(int, void**, unsigned long, CUstream_st*) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#2  0x00007bc05573ea99 in ctranslate2::cuda::CubCachingAllocator::allocate(unsigned long, int) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#3  0x00007bc055712796 in ctranslate2::StorageView::reserve(long) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#4  0x00007bc0557127f8 in ctranslate2::StorageView::resize(std::vector<long, std::allocator<long> >) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#5  0x00007bc0556f59f2 in void ctranslate2::ops::MatMul::compute<(ctranslate2::Device)1, float>(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView&) const ()
   from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#6  0x00007bc055660d24 in ctranslate2::layers::dot_product_attention(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const*, ctranslate2::StorageView const*, ctranslate2::StorageView const*, ctranslate2::StorageView const*, long, ctranslate2::StorageView&, ctranslate2::StorageView*, bool, float, bool, bool, long, ctranslate2::layers::Alibi*, ctranslate2::StorageView*) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#7  0x00007bc05566208d in ctranslate2::layers::MultiHeadAttention::operator()(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const*, ctranslate2::StorageView&, ctranslate2::StorageView*, ctranslate2::StorageView*, ctranslate2::StorageView*, ctranslate2::Padder const*, ctranslate2::Padder const*, bool, ctranslate2::StorageView*, long) const ()
   from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4

CT2_VERBOSE=3 LD_LIBRARY_PATH=/home/.local/lib/python3.10/site-packages/ctranslate2.libs whisper-ctranslate2 --language=en --verbose=true --model small -f srt --output_dir /tmp/ foo.mp4

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce 940MX           Off |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8             N/A /  200W |    1988MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------

small model (sadly) doesn't hold within my 2GB GPU but causes a segfault instead of failing properly.
happen with both the wheel and a hand-compiled .so
tiny model works (no OOM)
Important and unexpected useful workaround: Setting CT2_CUDA_ALLOW_BF16=1 CT2_CUDA_ALLOW_FP16=1 I could get small to run successfully on this GPU (!)

Minh-Thuc · Answer 1 · Mon May 27 2024 17:09:26 GMT+0800 (China Standard Time)

Hello, Do you use the quantization for the small model? Which compute type you use? It seems like this is only OOM problem because you don't have enough VRAM. nvidia-smi only shows you the used memory before the moment that the program crashes. When the program tries to allocate more memory, it exceeds 2GB.

Raphaël Droz · Answer 2 · Mon May 27 2024 20:01:30 GMT+0800 (China Standard Time)

Tested all of them (with small) without CT2_* env and got ValueError: Requested XXX compute type, but the target device or backend do not support efficient XXX computation. except for float32 which triggers a segfault.
float32 always segfaults
Setting CT2_CUDA_ALLOW_FP16=1 it only works for float16 (others trigger ValueError)
Setting CT2_CUDA_ALLOW_BF16=1, then bfloat16 gives RuntimeError: cuDNN failed with status CUDNN_STATUS_ARCH_MISMATCH (others trigger ValueError)

auto and default select float32:

[2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow INT8: false
[2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow FP16: false (with Tensor Cores: false)
[2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow BF16: false
[2024-05-27 08:57:19.253] [ctranslate2] [thread 3417167] [info] Using CUDA allocator: cub_caching
[2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Binary version: 6
[2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Model specification revision: 3
[2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Selected compute type: float32

medium segfault even with CT2_CUDA_ALLOW_FP16=1

Minh-Thuc · Answer 3 · Tue May 28 2024 21:06:47 GMT+0800 (China Standard Time)

Try the quantization int8 or float16. Your GPU is small to work with medium model float32, it's normal. Bfloat16 only works with GPU 8.x or newer (your GPU could be only 7.x)