antimatter15 / alpaca.cpp

Locally run an Instruction-Tuned Chat-Style LLM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Painfully slow

jacobhweeks opened this issue · comments

I managed to get alpaca running on a Hyper-V VM on my PowerEdge R710. The VM has 8 cores and 16GB of RAM. Running Ubuntu 22.04. I had to Make chat from source, otherwise I got an Illegal Instruction error. The problem is that it takes like 1 minute per token. How can I improve this?

main: seed = 1680482599
llama_model_load: loading model from 'ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: ggml ctx size = 6065.34 MB
llama_model_load: memory_size = 2048.00 MB, n_mem = 65536
llama_model_load: loading model part 1/1 from 'ggml-alpaca-7b-q4.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

I would like to rebuild from source while enabling each of the options such as AVX, AVX2 one at a time so that I can fine tune it for my machine, but I don't know where to do this in chat.cpp.
I also don't know that this is a good solution, it's just an idea. Were these features disabled when I compiled it from source? Possibly because my machine doesn't support those features?
Can anyone please help?

from what I could see, the settings are automatically calculated. I believe it's set through a c script by the name of ggml.c using mathematical equations. you could make a backup of that file then edit it manually and see what suit your machine. just don't go overboard or your machine might freeze and will have to reboot it.

@Tempaccnt thank you, I realized that they are automatically set based on the CPU when I realized that those were CPU flags last night. That's a bummer. Will I gain any benefit from installing a GPU if I can tie the VM to it, or does the software only utilize CPU performance?

As far as I have read, this approach using alpaca is CPU only. I am looking to figure things out myself since I am running an old i7-6700k and tokens are crawling and the CPU is pegged right now.

I am seeing similar slow speeds on an Intel Mac. If anyone has any suggestions, I would be most appreciative. Thanks.