huggingface / swift-coreml-diffusers

Swift app demonstrating Core ML Stable Diffusion

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High RAM usage on GPU mode compared to using apple/ml-stable-diffusion CLI tool

svenkevs opened this issue · comments

I noticed that the diffuser app, while running on GPU mode, uses just over 13GB RAM while infering on the non-quantized SDXL 1.0 model. If I use pretty much the same settings with Apple's Core ML Stable Diffusion software (https://github.com/apple/ml-stable-diffusion), on the same model, my system uses just under 8GB of ram. Both result in different pictures. Hardware: Apple Mac-Mini M2 Pro, 16GB RAM, latest MacOS 14 public beta.

swift-coreml-diffuser settings:

Positive prompt: a photo of an astronaut dog on mars
Negative prompt: [empty]
Guidance Scale: 7.5
Step count: 20
Preview count: 25
Random seed: 4184258190
Advanced: GPU
Disable Safety Checker: Selected

Commandline prompt with arguments:
swift run StableDiffusionSample "a photo of an astronaut dog on mars" --compute-units cpuAndGPU --step-count 20 --seed 4184258190 --resource-path <path to model> --xl --disable-safety --output-path <path to image folder>

I do make the assumption here that selecting GPU is in actual fact the same as the CLI's cpuAndGPU (considering the CLI has no GPU option). Perhaps the difference lies there? In that case, can cpu & gpu mode support be added?

First time loading the model in the app (e.g. first time after starting app or switching models) also takes a lot longer vs. loading time in the command line. 13GB of RAM use by the app leads to a bunch of swapfile use on my 16GB M2 Pro Mac Mini, while running the CLI tool does not lead to swap file use, which most likely explains this difference.

Considering model sizes and RAM usage, it almost looks like the app is loading the model twice? That's pure speculation though, I imagine there's plenty of overhead involved. But considering the App itself, before a model is loaded, uses 40MB of ram, there's a difference with the commandline tools of just over 5GB (about the size of the unet weights) while generating an image.

I haven't tested for non-sdxl models, I might follow up if I find some time for that (at which point I can also compare ram use when using the neural engine).

I'm honestly not sure if this is a bug or simply caused by some different settings/features under the hood I am not aware of. But it is an issue for how usable the software is on machines with lower ram.

Isn't 13GB of memory consumed when switching to the second model and running? Each model uses about 9GB, but they seem to be retained in memory. A simple workaround might be to restart the app when switching models. I have not observed it in detail, so I apologize if I am wrong.

No, I specifically started the app fresh before then testing several runs without switching the model. Switching between other models also doesn't seem to cause old models to stay loaded if I look at RAM usage by the app. Though the OS might very well cache them, so I should clarify that above values were RAM usage by the process specifically.

is that so. sorry. My quick observation is that

  • SDXL base : 9GB
  • SDXL base -> SDXL 4.6 MBP : 14GB
    However, I understand that due to your case, a fresh start does not work.

Don't be sorry, I find your insights very helpful, it allows me to consider new angles and do some more testing :)

My memory use when the model is loaded, but I am not yet generating an image is just over 9GB.
But when generating the image, this rises to just over 13GB with peaks to just over 14GB of RAM use.
After generating an image, when the app is idle, the RAM use drops back to just over 13GB. It doesn't go back to the original 9GB.

Could you maybe see if you have the same pattern?

I also noticed that the peaks, and a sudden high memory pressure, coincide with the moments a new preview image is generated (I had it make 5 preview images with a 25 total step count).
Screenshot 2023-07-30 at 08 18 00

If I put the 'Preview count to 0', memory use is a bit more stable. I don't think it matters much, but might be something to consider when your device is running on its limits.

Within each step, after the 6th step or so, the app seems to heavily switch between App memory and compressed memory within each cycle (compressed memory going between a few hundred MB to 7GB and back in each step).

Thanks for the detailed report, @svenkevs. I can confirm that the cli cpuAndGpu option is the same as GPU in the app.

There is one difference in the default settings between both: the app uses the dpmpp scheduler while the CLI uses the default pndm scheduler. That could explain the difference in results, but I don't think it should cause the increased memory use.

Could you please try to disable previews in the app by setting preview count to 0? Generating previews engages the image decoder, which certainly has a significant memory cost (especially for the large 1024x1024 images this model produces).

@pcuenca I just did, I think we commented here at the same time. It doesn't make a big difference.

@pcuenca I just did, I think we commented here at the same time. It doesn't make a big difference.

Yes we did :) Did you test with no previews after a fresh app restart?

I just redid it with not just a fresh app start, but a reboot of my mac before starting the app (which was already set to the right model so it loaded it directly). Still the same behaviour:

Screenshot 2023-07-30 at 08 46 56
Screenshot 2023-07-30 at 08 49 00
Screenshot 2023-07-30 at 08 47 51
Screenshot 2023-07-30 at 08 50 02
Screenshot 2023-07-30 at 08 50 47

That's certainly strange. This is the memory report in Xcode (it matches the one in Activity Monitor, not shown) when running on my system:

Screenshot 2023-07-30 at 14 37 09

The first half is using no previews. Memory stays at ~9.5 GB until the process finishes, and then it spikes to ~13 GB while decoding the image. The second half is using 5 previews, and we can clearly see 5 spikes when previews are generated, and a final one after generation finishes and the final image is decoded. I see the same behaviour running the CLI.

It almost looks as if the VAE was active all the time in your tests, but I don't know what could lead to that. I'm using Xcode 15.0 beta 5, and macOS 14.0 Beta (23A5301g).

You could try to reduce memory usage by unloading models no longer in use, inserting a return true here: https://github.com/huggingface/swift-coreml-diffusers/blob/main/Diffusion/Common/ModelInfo.swift#L108. When doing that in my system, memory goes to ~8 GB during denoising and ~11 GB while decoding. If you had time to try that and see if you can prevent paging, that'd be really helpful!