armbues / SiLLM

SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

slowness of sillm.chat on M2 Air with 16GB Ram

kylewadegrove opened this issue · comments

Any invocation of python -m sillm.chat model seems much slower on my machine than in the reference video--more than a minute to get to the prompt, and maybe 1-2 TPM in the response.

I have tried sillm.chat with two different models downloaded from HF via the download.py scripts in the SiLLM-examples repo: [Mistral-7B-Instruct-v0.2] and [Qwen1.5-7B-Chat]; a LLAMA3 model that I downloaded directly from HF exhibited the same behavior.

Machine specs: MacBook Air M2 16GB memory on Sonoma 14.4.1, running Python 3.12.3 in a conda environment.

This might be an mlx-lm issue, as mlx_lm.generate is also very slow.

I can run sillm.chat with Llama3 8b at 10.79 tok/sec on a MacBook Air M1 2020 w/ 16GB RAM. I'm using Python 3.11.

I suspect the inference to be memory-constrained in this case, if you're trying to run the full 7B and 8B models. Without quantization, Llama-3 8B will take 15,316 MB of memory on my Mac Studio before even starting the chat. This means your 16 GB MacBook Air starts swapping memory with the disk and the speed drops significantly.

Try quantizing the model (argument -q4 or -q8) when running sillm.chat. On my MacBook Air M2 16GB (sounds like the same config) I'm getting 9.20 tok/sec with Llama-3-8b-instruct quantized to 8-bit with under 8 GB of memory used.

FYI the reference video with the MacBook Air is using the Gemma-2B-it model, which is a small/fast model and larger models (7B and 8B) will run slower.

Seems to it, quantized to 4 bit had reasonable performance.

Duh, I can't believe I forgot to mention my Llama3 8b was 4-bit quantized...