Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.

Home Page:https://llamafile.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Non-AVX execution - SSE-only on Pentiums/Celerons?

ttsiodras opened this issue · comments

On an N5095 Celeron machine, there are no AVX extensions. I believe the same applies to most Pentiums and Celerons...

$ grep ^flags /proc/cpuinfo  | head -1 | sed 's,^.*: ,,;s, ,\n,g'  | grep sse
sse
sse2
ssse3
sse4_1
sse4_2

Is it possible to run these models with SSE instructions?

I understand that AVX gives more speed because of wider registers, but is it just this, or some need for specific AVX functionality?

This was working on non-AVX prior to the recent update, I have the same issue as I'm running on an older xeon server. Would appreciate an updated branch for non-avx if possible!

I have confirmed non-avx issue llamafile-0.7 --last working versionn for me was 0.62

See cdd7458 where AVX became mandatory. llamafile 0.6.2 was the last version that supported SSSE3+. We currently lack the ability to runtime dispatch anything but a few performance critical routines (e.g. matmul) so the vast majority of code runs at the baseline ISA and if we have that be SSE then it makes things significantly slower on modern CPUs. Look at the benchmarks in my blog post https://justine.lol/matmul/ and notice how big of a disparity there was between llamafile 0.6.2 and llama.cpp. Our decision to mandate AVX played a big role in helping us catch up with llama.cpp and then surpass it.

Part of what changed with AVX is that it introduced a new method of encoding instructions (VEX encoding) and Intel decided that they were going to penalize any code that uses the legacy SSE encodings. So in many respects SSE encoded code is radioactive to performance and there's not much we can do about that.

thank you for the transparency! I do appreciate you keeping the older repo up as well :)