triton-inference-server / fastertransformer_backend

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NCCL 'unhandled cuda error'

SeibertronSS opened this issue · comments

Branch: main
Triton Docker Version: 22.03-py3
GPU: v100
CUDA version: 11.2
Driver Version: 460.32.03

I use Triton+FT to deploy the GPT-NeoX 20B model. When I set the model output length greater than 180, the following error will occur

I0526 10:11:21.620680 101 libfastertransformer.cc:1231] streaming response is sent
I0526 10:11:21.620698 101 python.cc:616] model postprocessing, instance postprocessing_0, executing 1 requests
I0526 10:11:21.621395 101 infer_response.cc:166] add response output: output: OUTPUT, type: BYTES, shape: [1]
I0526 10:11:21.621417 101 pinned_memory_manager.cc:161] pinned memory allocation: size 322, addr 0x7eff9a000180
I0526 10:11:21.621423 101 ensemble_scheduler.cc:523] Internal response allocation: OUTPUT, size 322, addr 0x7eff9a000180, memory type 1, type id 0
I0526 10:11:21.621441 101 ensemble_scheduler.cc:538] Internal response release: size 322, addr 0x7eff9a000180
I0526 10:11:21.621456 101 infer_response.cc:140] add response output: output: output_log_probs, type: FP32, shape: [1,1,200]
I0526 10:11:21.621485 101 grpc_server.cc:2544] GRPC: unable to provide 'output_log_probs' in GPU, will use CPU
I0526 10:11:21.621499 101 grpc_server.cc:2555] GRPC: using buffer for 'output_log_probs', size: 800, addr: 0x7efef406c8a0
I0526 10:11:21.801547 101 infer_response.cc:140] add response output: output: cum_log_probs, type: FP32, shape: [1,1]
I0526 10:11:21.801605 101 grpc_server.cc:2544] GRPC: unable to provide 'cum_log_probs' in GPU, will use CPU
I0526 10:11:21.801622 101 grpc_server.cc:2555] GRPC: using buffer for 'cum_log_probs', size: 4, addr: 0x7efef406ccf0
I0526 10:11:21.801686 101 infer_response.cc:140] add response output: output: sequence_length, type: UINT32, shape: [1,1]
I0526 10:11:21.801704 101 grpc_server.cc:2544] GRPC: unable to provide 'sequence_length' in GPU, will use CPU
I0526 10:11:21.801710 101 grpc_server.cc:2555] GRPC: using buffer for 'sequence_length', size: 4, addr: 0x7efef406ce60
I0526 10:11:21.801738 101 infer_response.cc:140] add response output: output: OUTPUT_0, type: BYTES, shape: [1]
I0526 10:11:21.801780 101 grpc_server.cc:2544] GRPC: unable to provide 'OUTPUT_0' in CPU_PINNED, will use CPU
I0526 10:11:21.801793 101 grpc_server.cc:2555] GRPC: using buffer for 'OUTPUT_0', size: 322, addr: 0x7efef406d0c0
I0526 10:11:21.801812 101 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7eff9a000180
I0526 10:11:21.801844 101 grpc_server.cc:4188] ModelStreamInferHandler::StreamInferComplete, context 3, 4 step ISSUED, callback index 137, flags 0
I0526 10:11:21.801873 101 grpc_server.cc:2637] GRPC free: size 800, addr 0x7efef406c8a0
I0526 10:11:21.801882 101 grpc_server.cc:2637] GRPC free: size 4, addr 0x7efef406ccf0
I0526 10:11:21.801894 101 grpc_server.cc:2637] GRPC free: size 4, addr 0x7efef406ce60
I0526 10:11:21.801899 101 grpc_server.cc:2637] GRPC free: size 322, addr 0x7efef406d0c0
I0526 10:11:21.801944 101 python.cc:1960] TRITONBACKEND_ModelInstanceExecute: model instance name postprocessing_0 released 1 requests
I0526 10:11:21.802584 101 grpc_server.cc:3825] Process for ModelStreamInferHandler, rpc_ok=1, context 3, 4 step WRITEREADY
I0526 10:11:21.802696 101 grpc_server.cc:3825] Process for ModelStreamInferHandler, rpc_ok=1, context 3, 4 step WRITTEN
Failed, NCCL error /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'
Failed, NCCL error /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'
Failed, NCCL error /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'
corrupted double-linked list
Signal (11) received.
I0526 10:11:21.881235 101 libfastertransformer.cc:1266] Stop to forward
Signal (6) received.

Following is the output of the dmesg command after the error occurs
dmesg.txt
Graphics card remains at this high utilization after the error

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:0B.0 Off |                    0 |
| N/A   37C    P0    44W / 250W |  12085MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   37C    P0    44W / 250W |  12059MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:00:0D.0 Off |                    0 |
| N/A   37C    P0    43W / 250W |  12059MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:00:0E.0 Off |                    0 |
| N/A   39C    P0    45W / 250W |  12059MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    282818      C   tritonserver                    12053MiB |
|    1   N/A  N/A    282818      C   tritonserver                    12053MiB |
|    2   N/A  N/A    282818      C   tritonserver                    12053MiB |
|    3   N/A  N/A    282818      C   tritonserver                    12053MiB |
+-----------------------------------------------------------------------------+