CUDA runtime error: invalid device function sampling_topp_kernels.cu
SaigyoujiYuyuko233 opened this issue · comments
Hi guys,
I'm using:
- Model: codegen-2B-multi
- GPU: GTX 1070 w/ 8G VRAM
- Sys: Fedora36
5.18.16-200.fc36.x86_64
- NV Drive 515.57 w/ CUDA 11.7
Using podman as container runtime with NV container toolkit
Client: Podman Engine
Version: 4.1.1
API Version: 4.1.1
Go Version: go1.18.4
Built: Fri Jul 22 15:05:59 2022
OS/Arch: linux/amd64
cat /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
"version": "1.0.0",
"hook": {
"path": "/usr/bin/nvidia-container-toolkit",
"args": ["nvidia-container-toolkit", "prestart"],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
},
"when": {
"always": true,
"commands": [".*"]
},
"stages": ["prestart"]
}
Command nvdia-smi
works fine in container.
Problem
The triton server started fine but it crashes when I request it using OpenAI API demo written in the readme.
Is this a GPU compatibility issue? if yes, which GPU model is supported?
Any help will be appreciated!
Hmm, FasterTransformer has only been tested on Compute Capability >= 7.0, and the 1070 is 6.0. So it's possible something it uses is limited to more recent cards. For now I'll add a note to the README but I'll leave this open to investigate further.
The line that's failing is:
check_cuda_error(
cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
cub_temp_storage_size,
log_probs,
(T*)nullptr,
id_vals,
(int*)nullptr,
vocab_size * batch_size,
batch_size,
begin_offset_buf,
offset_buf + 1,
0, // begin_bit
sizeof(T) * 8, // end_bit = sizeof(KeyT) * 8
stream)); // cudaStream_t
Hmm, FasterTransformer has only been tested on Compute Capability >= 7.0, and the 1070 is 6.0. So it's possible something it uses is limited to more recent cards. For now I'll add a note to the README but I'll leave this open to investigate further.
The line that's failing is:
check_cuda_error( cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr, cub_temp_storage_size, log_probs, (T*)nullptr, id_vals, (int*)nullptr, vocab_size * batch_size, batch_size, begin_offset_buf, offset_buf + 1, 0, // begin_bit sizeof(T) * 8, // end_bit = sizeof(KeyT) * 8 stream)); // cudaStream_t
Thanks for reply! I will get a better card later.
After my testing, now this works fine on 1060 (Compute Capability 6.1).
BTW, the Computer Capability of 1070 should also be 6.1.
Closing this as @Frederisk finds it working. If still an issue, please reopen.