ml-explore / mlx-swift-examples

Examples using MLX Swift

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU Memory/Cache Limit

r4ghu opened this issue · comments

Hi,
First of all, thank you for providing the swift example implementations, which helped me understand many components inside MLX. I am implementing an inference for a quantized Flan-T5 model and can get a working solution.

I am currently trying to optimize the model's GPU usage during inference so I can integrate it into other applications. I am a bit new to this area and trying to understand the difference between the following - MLX.GPU.set(memoryLimit:), MLX.GPU.set(cacheLimit:) and GPU Usage stats from asitop (or) Activity Monitor.

I expected less GPU usage if the GPU's memory limit is set to 200 MB (or GPU.set(memoryLimit: 200 * 1024 * 1024, relaxed: true)). But irrespective of the memory limit I set, I see >90% GPU usage all the time.

Also I observed the following things -

  • I can set any cacheLimit value (from 0 to max GPU memory limit) and observe the changes in the inference times (slow to fast) with no changes to GPU usage.
  • I could only set the memoryLimit to any value greater than the size of my model. For any value less than my model size, the program halts during inference and doesn't resume. I expected a wait operation to happen inside MLX and resume after the current ops queue is finished. But this was not the case.

My goal is to lower the overall GPU usage (and hopefully corresponding power usage) during inference and I am ok with any drop in the inference time to achieve this. I would appreciate any suggestions on how I can achieve this.

I think the GPU Usage metric you are looking at is akin to CPU Usage -- how much activity. The memory is unified between the CPU and GPU, so there isn't a specific metric for GPU memory.

I think you might be well served with set(cacheLimit:) -- this controls the amount of memory that MLX will keep around after it is used (so that it can be reused without reallocating).

If you set this to 0 then when a buffer is released the backing memory will also be released. This will be the minimum amount of memory that you can use without loading & unloading parts of your model. The performance may suffer, but in my experience it actually isn't too bad.

Then you can experiment by setting it larger and larger. Try 64k, 256k, 1M, 256M, 1G, etc. It all depends on how much memory your inference takes (which is typically related to the number of tokens in the cache). Anyway, you may find that there is a smallish value that gives good performance and keeps the memory use under control.

If you are curious, there are some more details in #17

Thanks @davidkoski for the quick response!

If I understand this correctly, I should handle the cacheLimit to control the memory and not rely on the GPU usage metrics.

I can confirm that setting the cacheLimit to 0 didn't affect the performance too badly.