RPC issues and comments

Question

RPC issues and comments

steampunque opened this issue 25 days ago · comments

Local LAN 1x 1070 1x 4070 1x 4070 configured with new RPC with patched server to use RPC.

I did a run to fully offload mixtral Q4_K_M into the 3 GPUs with RPC all looked good:

llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: RPC buffer size = 7043.34 MiB (1070)
llm_load_tensors: RPC buffer size = 9391.12 MiB (4070)
llm_load_tensors: RPC buffer size = 8711.09 MiB (4070)

All layers offloaded and the timings I am getting are:

pp 105.99 tokens per second
tg 25.68 tokens per second

This compares to around 5tps generation with CPU+GPU to 1 4070 so over 5x speedup is nice. And it seems to be working OK. Some issues I have found so far :

The rpc servers are spamming rpc_get_tensor and rpc_set_tensor messages to console, this needs to be shut off unless doing debugging.

I initially tried a partial offload to two machines (8G +12G) but I got an out of memory crash on one of the servers, so I
am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU? More of a question. It should be possible to pick up the remaining layers that won't RPC into GPUs on the host running the server (my host has 128G RAM).

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

Also Great work on this feature it is extremely useful! It will be very good to support mixed CPU and GPU with this mode though so the crazy models such as dbrx commandr+ falcon180 and the 70G llama3 monster could be run if desired.

Georgi Gerganov · Answer 1 · Wed May 15 2024 14:23:18 GMT+0800 (China Standard Time)

I am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU?

Might be possible to start 2 RPC servers on the same machine - one CPU and one GPU?

Radoslav Gerganov · Answer 2 · Wed May 15 2024 14:36:39 GMT+0800 (China Standard Time)

Thanks for the feedback!

The rpc servers are spamming rpc_get_tensor and rpc_set_tensor messages to console, this needs to be shut off unless doing debugging.

Sure, will turn this off by default.

I am guessing that RPC mode currently does not support mixed CPU and GPU offload

The problem is that we don't report available memory on CPU and Metal and it defaults to 1MB:
https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/rpc-server.cpp#L41
As a stopgap solution I will add command line arguments for free_mem and total_mem until we implement this for all platforms.

It should be possible to pick up the remaining layers that won't RPC into GPUs on the host running the server (my host has 128G RAM).

Agree, will be working on this.

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

I guess this is because rpc-server fails to start on the same port and it needs some time for the old socket to expire. Does changing the port helps?

steampunque · Answer 3 · Wed May 15 2024 20:46:24 GMT+0800 (China Standard Time)

I am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU?

Might be possible to start 2 RPC servers on the same machine - one CPU and one GPU?

good idea I did not think of that. For a big model like Mixtral 8x22b loading would be very slow though pushing 50-60Gbyes thru local RPC socket instead of moving the layers straight from of file into memory ... and also slower on inference though Im guessing that would not be a huge overhead hit since CPU is not fast anyway.

steampunque · Answer 4 · Wed May 15 2024 21:10:40 GMT+0800 (China Standard Time)

I am guessing that RPC mode currently does not support mixed CPU and GPU offload

The problem is that we don't report available memory on CPU and Metal and it defaults to 1MB: https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/rpc-server.cpp#L41 As a stopgap solution I will add command line arguments for free_mem and total_mem until we implement this for all platforms.

That would be perfectly fine stopgap solution thanks!

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

I guess this is because rpc-server fails to start on the same port and it needs some time for the old socket to expire. Does changing the port helps?

Most likely it would work but then host (serverrpc in my case) needs to be also restarted on the new port address for the machine that crashed. To handle this cleanly most likely need to install a signal handler trapping ideally any signal that can abort the rpc server (SIGHUP SIGTERM) and the errors (SIGSEGV) and clean up the port before exit. I think SEGVs are extremely tricky to handle this way though https://stackoverflow.com/questions/2663456/how-to-write-a-signal-handler-to-catch-sigsegv so might be better to find the source of the SEGV when it ran out of memory and just go to a soft reset state when whatever happens that would have caused the SEGV (running out of GPU memory did it here) happens. When I stopped the program with CTL-C on the command line it was able to restart on the same port no problem but I didn't try stopping it using kill.

Radoslav Gerganov · Answer 5 · Wed May 15 2024 21:26:51 GMT+0800 (China Standard Time)

I think that setting SO_REUSEADDR on the socket might help here. Will give it a try and submit a PR if it works.

Radoslav Gerganov · Answer 6 · Thu May 16 2024 18:18:27 GMT+0800 (China Standard Time)

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.)

I believe I fixed this with #7320 . Let me know if it works for you.

steampunque · Answer 7 · Fri May 17 2024 01:36:41 GMT+0800 (China Standard Time)

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.)

I believe I fixed this with #7320 . Let me know if it works for you.

NO PATCH:

bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7918 MB
Accepted client connection, free_mem=8303280128, total_mem=8500477952
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6144.00 MiB on device 0: cudaMalloc failed: out of memory
/usr/local/bin/ll_startrpc: line 1: 27019 Segmentation fault rpc-server 0.0.0.0 50052
bash-5.1$
bash-5.1$
bash-5.1$ rpc-server 0.0.0.0 50052
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7918 MB
Failed to create server socket

WITH PATCH

bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7822 MB
Accepted client connection, free_mem=8202092544, total_mem=8500477952
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6144.00 MiB on device 0: cudaMalloc failed: out of memory
/usr/local/bin/ll_startrpc: line 1: 28747 Segmentation fault rpc-server -H 0.0.0.0 -p 50052
bash-5.1$
bash-5.1$
bash-5.1$
bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7822 MB

I applied patch to b2901 and it looks to fix the problem, I never reset the RPC subsystem after first crash.

I'd argue the SEGV is a bug which should be fixed. It should be debuggable with gdb by starting a debug version of rpc-server with gdb and see what code is doing the illegal memory access when it runs out of memory. You can force this to crash like this just by allocating a huge context here I use 16384 with phi-3 into a 8G GPU and it should reliably
run out of memory and then SEGV.

FATTN=0 RPC=1 C4k=16384 ll_start phi-3 1

Thanks for fixing this problem!

On a RPC related note I was testing mixtral fully offloaded to 3 GPUs on 3 separate machines yesterday and on one of my test prompts it just stopped generating shortly after beginning to generate the answer. Testing with CPU + 1 GPU local offload no RPC the prompt worked fine. Don't know if this is an RPC issue or a multiple GPU issue related to splitting KV across the 3 GPUs. If I can find anything more out about it I will create another issue note about this newly discovered problem.

b2901 seems to be working fine, phi-3 scored identical lambada and nearly identical prompt and gen speeds with both RPC and local CPU/GPU modes. Thank you also for getting rid of all that messaging it runs much cleaner now.

Q9650 CPU +1070 GPU

GPU RPC (fully offloaded)
566.47 tokens per second
41.30 tokens per second

GPU (fully offloaded)
601.75 tokens per second
49.17 tokens per second