Running very slow
reddiamond1234 opened this issue · comments
Hi, my alpaca is running very slow and i dont know why. i am running it on ubuntu VM, with hardware specs Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz, RAM 64 GB, no GPU. any tips?
How many cpu cores are assigned to the virtual machine? The cpu has 12 cores and 24 threads, you start the program with -t 12? And which model do you use?
How many cpu cores are assigned to the virtual machine? The cpu has 12 cores and 24 threads, you start the program with -t 12? And which model do you use?
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 42 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 22
On-line CPU(s) list: 0-21
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz
CPU family: 6
Model: 62
Thread(s) per core: 1
Core(s) per socket: 11
Socket(s): 2
Stepping: 4
i changed code to run on 18 threads, because it always defaults to 4, even with 18 threads is slow
My cpu has 16 threads, but only 8 physical cores, with 8 threads it's faster on my system. Do you use the 7B, 13B or 30B model and how much time per token?
You can get it like this:
./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long."
...
634.18 ms per token
Adjust the model filename/path and the threads. On my system the text generation with the 30b model is not fast too.
My cpu has 16 threads, but only 8 physical cores, with 8 threads it's faster on my system. Do you use the 7B, 13B or 30B model and how much time per token? You can get it like this:
./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long." ... 634.18 ms per token
Adjust the model filename/path and the threads. On my system the text generation with the 30b model is not fast too.
I'm using 7B version. here is same 'prompt' you had (./chat -m ggml-alpaca-7b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long.")
main: mem per token = 14434244 bytes
main: load time = 4293.24 ms
main: sample time = 172.82 ms
main: predict time = 372913.59 ms / 2155.57 ms per token
main: total time = 386418.75 ms
It is memory bound.
model.type | model.size | quantization | blas | context.size | eta | perplexity | efficiency |
---|---|---|---|---|---|---|---|
llama | 7B | q4_0 | 1 | 2048 | 0.96 | 5.56 | 0.19 |
llama | 7B | q4_0 | 1 | 1024 | 0.67 | 5.71 | 0.26 |
alpaca | 7B | q4_0 | 1 | 2048 | 0.95 | 5.77 | 0.18 |
alpaca | 7B | q4_0 | 1 | 1024 | 0.67 | 5.93 | 0.25 |
llama | 7B | q4_0 | 1 | 512 | 0.53 | 6.46 | 0.29 |
alpaca | 7B | q4_0 | 1 | 512 | 0.53 | 6.65 | 0.28 |
YMMV with alpacas.
@reddiamond1234, have you tried compiling from source? See #88 .
Went from 4868.26 ms per token
to 890.21 ms per token
for me when testing ./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long."
(7B model)
@reddiamond1234, have you tried compiling from source? See #88 .
Went from
4868.26 ms per token
to890.21 ms per token
for me when testing./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long."
(7B model)
problem was with my CPU, there are no AVX and it makes it slow.