[FT][ERROR] CUDA runtime error: out of memory /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:220

Question

[FT][ERROR] CUDA runtime error: out of memory /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:220

bigmover opened this issue a year ago · comments

Description

After start a fastertransformer backend server.
$curl -v localhost:4008/v2/healthy/ready
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 4008 (#0)
> GET /v2/healthy/ready HTTP/1.1
> Host: localhost:4008
> User-Agent: curl/7.58.0
> Accept: */*
>
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
* Failed writing body (0 != 46)
* stopped the pause stream!
* Closing connection 0

$all results are True for func <is_server_live()> <is_model_ready()> <is_server_ready()>
but while I post a request  server throw an runtime error:
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: out of memory /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:220

Signal (6) received.
 0# 0x0000557F92033459 in /opt/tritonserver/bin/tritonserver
 1# 0x00007F3D6BB3C090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F3D6BEF5911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F3D6BF0138C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F3D6BF013F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F3D6BF016A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# fastertransformer::Allocator<(fastertransformer::AllocatorType)0>::malloc(unsigned long, bool, bool) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# void* fastertransformer::IAllocator::reMalloc<int>(int*, unsigned long, bool, bool) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
11# LlamaTritonModelInstance<__half>::allocateBuffer(unsigned long, unsigned long, unsigned long, unsigned long) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
12# LlamaTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
13# 0x00007F3CA169FC88 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
14# 0x00007F3CA16B809A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
15# 0x00007F3D6BF2DDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
16# 0x00007F3D6D2A5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
17# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Aborted (core dumped)

Reproduced Steps

only one GPU I used
start a server with command：
$CUDA_VISIBLE_DEVICES=1 /opt/tritonserver/bin/tritonserver  --model-repository=./triton_model_store/llama/ --grpc-port 4008

Sha · Answer 1 · Mon Aug 14 2023 20:34:24 GMT+0800 (China Standard Time)

How did you solve it?