Running very slow

Question

Running very slow

reddiamond1234 opened this issue a year ago · comments

Hi, my alpaca is running very slow and i dont know why. i am running it on ubuntu VM, with hardware specs Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz, RAM 64 GB, no GPU. any tips?

supportend · Answer 1 · Fri Apr 14 2023 15:25:45 GMT+0800 (China Standard Time)

How many cpu cores are assigned to the virtual machine? The cpu has 12 cores and 24 threads, you start the program with -t 12? And which model do you use?

jan jovan · Answer 2 · Fri Apr 14 2023 16:42:37 GMT+0800 (China Standard Time)

How many cpu cores are assigned to the virtual machine? The cpu has 12 cores and 24 threads, you start the program with -t 12? And which model do you use?

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 42 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 22
On-line CPU(s) list: 0-21
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz
CPU family: 6
Model: 62
Thread(s) per core: 1
Core(s) per socket: 11
Socket(s): 2
Stepping: 4

i changed code to run on 18 threads, because it always defaults to 4, even with 18 threads is slow

supportend · Answer 3 · Fri Apr 14 2023 17:05:40 GMT+0800 (China Standard Time)

My cpu has 16 threads, but only 8 physical cores, with 8 threads it's faster on my system. Do you use the 7B, 13B or 30B model and how much time per token?
You can get it like this:

./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long."
...
634.18 ms per token

Adjust the model filename/path and the threads. On my system the text generation with the 30b model is not fast too.

jan jovan · Answer 4 · Fri Apr 14 2023 19:50:17 GMT+0800 (China Standard Time)

My cpu has 16 threads, but only 8 physical cores, with 8 threads it's faster on my system. Do you use the 7B, 13B or 30B model and how much time per token? You can get it like this:
./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long."
...
634.18 ms per token
Adjust the model filename/path and the threads. On my system the text generation with the 30b model is not fast too.

I'm using 7B version. here is same 'prompt' you had (./chat -m ggml-alpaca-7b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long.")

main: mem per token = 14434244 bytes
main: load time = 4293.24 ms
main: sample time = 172.82 ms
main: predict time = 372913.59 ms / 2155.57 ms per token
main: total time = 386418.75 ms

Gary Mulder · Answer 5 · Fri Apr 14 2023 21:01:33 GMT+0800 (China Standard Time)

It is memory bound.

model.type	model.size	quantization	blas	context.size	eta	perplexity	efficiency
llama	7B	q4_0	1	2048	0.96	5.56	0.19
llama	7B	q4_0	1	1024	0.67	5.71	0.26
alpaca	7B	q4_0	1	2048	0.95	5.77	0.18
alpaca	7B	q4_0	1	1024	0.67	5.93	0.25
llama	7B	q4_0	1	512	0.53	6.46	0.29
alpaca	7B	q4_0	1	512	0.53	6.65	0.28

YMMV with alpacas.

Kifah Meeran · Answer 6 · Sat Apr 15 2023 21:12:35 GMT+0800 (China Standard Time)

@reddiamond1234, have you tried compiling from source? See #88 .

Went from 4868.26 ms per token to 890.21 ms per token for me when testing ./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long." (7B model)

jan jovan · Answer 7 · Wed Apr 19 2023 16:40:53 GMT+0800 (China Standard Time)

@reddiamond1234, have you tried compiling from source? See #88 .

Went from 4868.26 ms per token to 890.21 ms per token for me when testing ./chat -m ggml-alpaca-30b-q4.bin --color -t 8 --temp 0.8 -p "Write a text about Linux, 50 words long." (7B model)

problem was with my CPU, there are no AVX and it makes it slow.