CUDA DeviceAllocate segfault
drzraf opened this issue · comments
#0 0x00007bc0622c6554 in std::_Rb_tree_increment(std::_Rb_tree_node_base const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#1 0x00007bc05573e59a in cub::CachingDeviceAllocator::DeviceAllocate(int, void**, unsigned long, CUstream_st*) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#2 0x00007bc05573ea99 in ctranslate2::cuda::CubCachingAllocator::allocate(unsigned long, int) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#3 0x00007bc055712796 in ctranslate2::StorageView::reserve(long) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#4 0x00007bc0557127f8 in ctranslate2::StorageView::resize(std::vector<long, std::allocator<long> >) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#5 0x00007bc0556f59f2 in void ctranslate2::ops::MatMul::compute<(ctranslate2::Device)1, float>(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView&) const ()
from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#6 0x00007bc055660d24 in ctranslate2::layers::dot_product_attention(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const*, ctranslate2::StorageView const*, ctranslate2::StorageView const*, ctranslate2::StorageView const*, long, ctranslate2::StorageView&, ctranslate2::StorageView*, bool, float, bool, bool, long, ctranslate2::layers::Alibi*, ctranslate2::StorageView*) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#7 0x00007bc05566208d in ctranslate2::layers::MultiHeadAttention::operator()(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const*, ctranslate2::StorageView&, ctranslate2::StorageView*, ctranslate2::StorageView*, ctranslate2::StorageView*, ctranslate2::Padder const*, ctranslate2::Padder const*, bool, ctranslate2::StorageView*, long) const ()
from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
CT2_VERBOSE=3 LD_LIBRARY_PATH=/home/.local/lib/python3.10/site-packages/ctranslate2.libs whisper-ctranslate2 --language=en --verbose=true --model small -f srt --output_dir /tmp/ foo.mp4
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce 940MX Off | 00000000:01:00.0 Off | N/A |
| N/A 50C P8 N/A / 200W | 1988MiB / 2048MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------
small
model (sadly) doesn't hold within my 2GB GPU but causes a segfault instead of failing properly.- happen with both the wheel and a hand-compiled
.so
tiny
model works (no OOM)- Important and unexpected useful workaround: Setting
CT2_CUDA_ALLOW_BF16=1 CT2_CUDA_ALLOW_FP16=1
I could getsmall
to run successfully on this GPU (!)
Hello, Do you use the quantization for the small model? Which compute type you use? It seems like this is only OOM problem because you don't have enough VRAM. nvidia-smi
only shows you the used memory before the moment that the program crashes. When the program tries to allocate more memory, it exceeds 2GB.
-
Tested all of them (with
small
) withoutCT2_*
env and gotValueError: Requested XXX compute type, but the target device or backend do not support efficient XXX computation.
except forfloat32
which triggers a segfault. -
float32
always segfaults -
Setting
CT2_CUDA_ALLOW_FP16=1
it only works forfloat16
(others triggerValueError
) -
Setting
CT2_CUDA_ALLOW_BF16=1
, thenbfloat16
givesRuntimeError: cuDNN failed with status CUDNN_STATUS_ARCH_MISMATCH
(others triggerValueError
)
auto
and default
select float32
:
[2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow INT8: false
[2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow FP16: false (with Tensor Cores: false)
[2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow BF16: false
[2024-05-27 08:57:19.253] [ctranslate2] [thread 3417167] [info] Using CUDA allocator: cub_caching
[2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Binary version: 6
[2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Model specification revision: 3
[2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Selected compute type: float32
medium
segfault even withCT2_CUDA_ALLOW_FP16=1
Try the quantization int8 or float16. Your GPU is small to work with medium model float32, it's normal. Bfloat16
only works with GPU 8.x or newer (your GPU could be only 7.x)