fauxpilot / fauxpilot

FauxPilot - an open-source alternative to GitHub Copilot server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA runtime error: invalid device function sampling_topp_kernels.cu

SaigyoujiYuyuko233 opened this issue · comments

Hi guys,

I'm using:

  • Model: codegen-2B-multi
  • GPU: GTX 1070 w/ 8G VRAM
  • Sys: Fedora36 5.18.16-200.fc36.x86_64
  • NV Drive 515.57 w/ CUDA 11.7

Using podman as container runtime with NV container toolkit

Client:       Podman Engine
Version:      4.1.1
API Version:  4.1.1
Go Version:   go1.18.4
Built:        Fri Jul 22 15:05:59 2022
OS/Arch:      linux/amd64
cat  /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/bin/nvidia-container-toolkit",
        "args": ["nvidia-container-toolkit", "prestart"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ]
    },
    "when": {
        "always": true,
        "commands": [".*"]
    },
    "stages": ["prestart"]
}

Command nvdia-smi works fine in container.

image

Problem

The triton server started fine but it crashes when I request it using OpenAI API demo written in the readme.
image
image

Is this a GPU compatibility issue? if yes, which GPU model is supported?

Any help will be appreciated!

Hmm, FasterTransformer has only been tested on Compute Capability >= 7.0, and the 1070 is 6.0. So it's possible something it uses is limited to more recent cards. For now I'll add a note to the README but I'll leave this open to investigate further.

The line that's failing is:

            check_cuda_error(
                cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
                                                                   cub_temp_storage_size,
                                                                   log_probs,
                                                                   (T*)nullptr,
                                                                   id_vals,
                                                                   (int*)nullptr,
                                                                   vocab_size * batch_size,
                                                                   batch_size,
                                                                   begin_offset_buf,
                                                                   offset_buf + 1,
                                                                   0,              // begin_bit
                                                                   sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
                                                                   stream));       // cudaStream_t

Hmm, FasterTransformer has only been tested on Compute Capability >= 7.0, and the 1070 is 6.0. So it's possible something it uses is limited to more recent cards. For now I'll add a note to the README but I'll leave this open to investigate further.

The line that's failing is:

            check_cuda_error(
                cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
                                                                   cub_temp_storage_size,
                                                                   log_probs,
                                                                   (T*)nullptr,
                                                                   id_vals,
                                                                   (int*)nullptr,
                                                                   vocab_size * batch_size,
                                                                   batch_size,
                                                                   begin_offset_buf,
                                                                   offset_buf + 1,
                                                                   0,              // begin_bit
                                                                   sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
                                                                   stream));       // cudaStream_t

Thanks for reply! I will get a better card later.

After my testing, now this works fine on 1060 (Compute Capability 6.1).

BTW, the Computer Capability of 1070 should also be 6.1.

Closing this as @Frederisk finds it working. If still an issue, please reopen.