slowness of sillm.chat on M2 Air with 16GB Ram

Question

slowness of sillm.chat on M2 Air with 16GB Ram

kylewadegrove opened this issue 3 months ago · comments

Any invocation of python -m sillm.chat model seems much slower on my machine than in the reference video--more than a minute to get to the prompt, and maybe 1-2 TPM in the response.

I have tried sillm.chat with two different models downloaded from HF via the download.py scripts in the SiLLM-examples repo: [Mistral-7B-Instruct-v0.2] and [Qwen1.5-7B-Chat]; a LLAMA3 model that I downloaded directly from HF exhibited the same behavior.

Machine specs: MacBook Air M2 16GB memory on Sonoma 14.4.1, running Python 3.12.3 in a conda environment.

kylewadegrove · Answer 1 · Thu Apr 25 2024 07:18:41 GMT+0800 (China Standard Time)

This might be an mlx-lm issue, as mlx_lm.generate is also very slow.

James Reynolds · Answer 2 · Thu Apr 25 2024 11:22:02 GMT+0800 (China Standard Time)

I can run sillm.chat with Llama3 8b at 10.79 tok/sec on a MacBook Air M1 2020 w/ 16GB RAM. I'm using Python 3.11.

Armin Buescher · Answer 3 · Thu Apr 25 2024 13:16:27 GMT+0800 (China Standard Time)

I suspect the inference to be memory-constrained in this case, if you're trying to run the full 7B and 8B models. Without quantization, Llama-3 8B will take 15,316 MB of memory on my Mac Studio before even starting the chat. This means your 16 GB MacBook Air starts swapping memory with the disk and the speed drops significantly.

Try quantizing the model (argument -q4 or -q8) when running sillm.chat. On my MacBook Air M2 16GB (sounds like the same config) I'm getting 9.20 tok/sec with Llama-3-8b-instruct quantized to 8-bit with under 8 GB of memory used.

FYI the reference video with the MacBook Air is using the Gemma-2B-it model, which is a small/fast model and larger models (7B and 8B) will run slower.

kylewadegrove · Answer 4 · Thu Apr 25 2024 14:47:12 GMT+0800 (China Standard Time)

Seems to it, quantized to 4 bit had reasonable performance.

James Reynolds · Answer 5 · Fri Apr 26 2024 12:15:46 GMT+0800 (China Standard Time)

Duh, I can't believe I forgot to mention my Llama3 8b was 4-bit quantized...