ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

enable rpc for server

steampunque opened this issue · comments

I made a quick patch to server to test RPC running phi-3 fully offloaded onto a remote GPU with the server and all seemed OK, timings:

pp: 258.19 tokens per second
tg: 48.41 tokens per second

Run locally on the same GPU as the remote machine gives:

pp: 563.30 tokens per second
tg: 92.00 tokens per second

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

Patches are trivial:

printf("  --port PORT               port to listen (default  (default: %d)\n", sparams.port);

+ printf(" --rpc SERVERS comma separated list of RPC servers\n");

    } else if (arg == "--host") {
        if (++i >= argc) {
            invalid_param = true;
            break;
        }
        sparams.hostname = argv[i];

+ } else if (arg == "--rpc") {
+ if (++i >= argc) {
+ invalid_param = true;
+ break;
+ }
+ params.rpc_servers = argv[i];