ml-explore / mlx-swift-examples

Examples using MLX Swift

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LLMEval: Memory usage

ptliddle opened this issue · comments

I'm not sure if i should ask this here or the mlx-swift repo. Let me know if i should move it.

I'm trying to understand the memory usage. I have a 4billion parameter, quantized (4bit) Qwen2 model that i'm using for inference with the code in LLMEval. Sticking debug points in the code and tracking memory usage it seems just after load it uses around ~500mb. After the first inference this balloons to over 10gb (~10.3gb) and then doesn't reduce again even after inference is complete.

Can someone explain why this is? Is it just a case of the model loading things it needs lazily? If so is there a way to reset this so the model can drop back to it's pre inference, post weight load size to reduce the memory requirements of the app when the LLM is sitting idle?

I can explain most/some of it.

Right after you load the weights you are using 500m. This is easily explainable as the safetensors file is around that size: https://huggingface.co/mlx-community/Qwen1.5-0.5B-Chat-4bit/tree/main

OK, now to the mysterious part -- when you evaluate the model it needs some buffers for MLXArray as intermediates, the result and the cache (cache: [(MLXArray, MLXArray)]?). For the first token generated it might need 1 megabyte (let's say). The result and the cache are still pretty small, let's say 10k. The rest are intermediates that are recycled once the result is computed.

The key here is that the buffers that are no longer being used are not deallocated, they are recycled -- specifically they are put into a pool where they can be reused by later computations. Allocating memory and creating the data structures needed to pass them to Metal isn't free and many times these same size buffers are needed for the next evaluation.

See:

OK, so the code goes to produce token 2. The same 1M of intermediates are produced (roughly) and we have a new output, this time up to 15k. The code was able to reuse a lot of those buffers and recycle them.

Repeat N times, each time the size of the intermediates grows a little (each token produced makes the arrays 1 longer, the cache 1 longer, the result 1 longer). In previous tokens intermediates of size X were needed and could be reused, but now they are up to 2X or 3X, etc. We can't reuse the old buffers because they are too small, so they sit in the buffer pool waiting to be reused but the evaluation has moved past that.

By the end of the run (let's say at 500 tokens) you have the 500M of weights, ~20k of cache and result (I have no idea if that is the right size, but as an example) and 9.9G of buffers waiting to be reused.

If you run inference again it doesn't need to grow the memory because the buffers are sitting there waiting to be reused.

Now you might ask why so much? There is a policy (see the reference above) that governs when these buffers should be freed and it is based on what Metal reports as recommendedMaxWorkingSetSize and this is based (per what I read) on the size of physical memory. If you have a lot of memory the amount it is willing to keep around is higher.

In a recent update (ml-explore/mlx-swift#38) some of this policy is now exposed:

So for example in the example for LLM evaluation:

With this API you can tune the maximum amount of memory MLX is willing to allocate and also the amount of memory that it is willing to keep around in the buffer pool. Do note that the policy is currently applied on allocation, not on returning buffers to the buffer pool -- this means you might set the policy to keep 100M around but observe that it currently holds 300M. This is because the last evaluation produced 200M of buffers that were returned to the pool. On the next allocation (perhaps if there is a miss) it will discard anything over the 100M policy.

Give it a try and feel free to post issues back on the mlx repo (where the underlying API and policies are implemented). I know we would love to hear how people are using this and how it is working for them.

CC: @awni

Was this observation about memory growth with an older version of LLMEval? It should not grow to 10GB after #13

Thanks. This is exactly what i was looking for!

I had taken what was in LLMEval from a few days ago as an example and used it in an app i'm working on. When i set the cache limit it fixed the issue i was seeing and it stayed around 2.7GB instead, much more reasonable. This is for a 4B param model i'm using (Qwen1.5-4B-Chat-Q4). I was also able to run a 7B without issue and I am about to try a 14B. This is all on a 16GB M1 Pro.

@awni Yes, this is no longer an issue in the latest version of LLMEval. After i saw @davidkoski post i pulled the latest changes.

OK to close?