antimatter15 / alpaca.cpp

Locally run an Instruction-Tuned Chat-Style LLM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running very slow

reddiamond1234 opened this issue · comments

Hi, my alpaca is running very slow and i dont know why. i am running it on ubuntu VM, with hardware specs Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz, RAM 64 GB, no GPU. any tips?

How many cpu cores are assigned to the virtual machine? The cpu has 12 cores and 24 threads, you start the program with -t 12? And which model do you use?

How many cpu cores are assigned to the virtual machine? The cpu has 12 cores and 24 threads, you start the program with -t 12? And which model do you use?

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 42 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 22
On-line CPU(s) list: 0-21
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz
CPU family: 6
Model: 62
Thread(s) per core: 1
Core(s) per socket: 11
Socket(s): 2
Stepping: 4

i changed code to run on 18 threads, because it always defaults to 4, even with 18 threads is slow

My cpu has 16 threads, but only 8 physical cores, with 8 threads it's faster on my system. Do you use the 7B, 13B or 30B model and how much time per token?
You can get it like this:

./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long."
...
634.18 ms per token

Adjust the model filename/path and the threads. On my system the text generation with the 30b model is not fast too.

My cpu has 16 threads, but only 8 physical cores, with 8 threads it's faster on my system. Do you use the 7B, 13B or 30B model and how much time per token? You can get it like this:

./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long."
...
634.18 ms per token

Adjust the model filename/path and the threads. On my system the text generation with the 30b model is not fast too.

I'm using 7B version. here is same 'prompt' you had (./chat -m ggml-alpaca-7b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long.")

main: mem per token = 14434244 bytes
main: load time = 4293.24 ms
main: sample time = 172.82 ms
main: predict time = 372913.59 ms / 2155.57 ms per token
main: total time = 386418.75 ms

It is memory bound.

model.type model.size quantization blas context.size eta perplexity efficiency
llama 7B q4_0 1 2048 0.96 5.56 0.19
llama 7B q4_0 1 1024 0.67 5.71 0.26
alpaca 7B q4_0 1 2048 0.95 5.77 0.18
alpaca 7B q4_0 1 1024 0.67 5.93 0.25
llama 7B q4_0 1 512 0.53 6.46 0.29
alpaca 7B q4_0 1 512 0.53 6.65 0.28

YMMV with alpacas.

@reddiamond1234, have you tried compiling from source? See #88 .

Went from 4868.26 ms per token to 890.21 ms per token for me when testing ./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long." (7B model)

@reddiamond1234, have you tried compiling from source? See #88 .

Went from 4868.26 ms per token to 890.21 ms per token for me when testing ./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long." (7B model)

problem was with my CPU, there are no AVX and it makes it slow.