Possibly GPU memory leak?

Question

Possibly GPU memory leak?

kshieh1 opened this issue a year ago · comments

Hi,

Found a GPU out-of-memory(OOM) error when using comple in my project. I made a shorter test program out of your compel-demp.py :

import torch
from compel import Compel
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
from torch import Generator

device = "cuda"
pipeline = StableDiffusionPipeline.from_pretrained("dreamlike-art/dreamlike-photoreal-2.0",
                                                   torch_dtype=torch.float16).to(device)
# dpm++
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config,
                                                             algorithm_type="dpmsolver++")

COMPEL = True
compel = Compel(tokenizer=pipeline.tokenizer, text_encoder=pipeline.text_encoder)

i = 0
while True:
    prompts = ["a cat playing with a ball++ in the forest", "a cat playing with a ball in the forest"]

    if COMPEL:
        prompt_embeds = torch.cat([compel.build_conditioning_tensor(prompt) for prompt in prompts])
        images = pipeline(prompt_embeds=prompt_embeds, num_inference_steps=10, width=256, height=256).images
        #del prompt_embeds # not helping
    else:
        images = pipeline(prompt=prompts, num_inference_steps=10, width=256, height=256).images
    i += 1
    print(i, images)

    images[0].save('img0.jpg')
    images[1].save('img1.jpg')

Tested on Nvidia RTX-3050Ti Mobile GPU w/ 4G VRAM, an OOM exception will occur after 10~20 iterations. No OOM if use COMPEL = False.

Damian Stewart · Answer 1 · Fri Apr 14 2023 00:13:37 GMT+0800 (China Standard Time)

hmm, compel is basically stateless, there isn't much that could leak that i have much control over. torch is sometimes poor at cleaning up its caches properly, you might want to try calling torch.cuda.empty_cache() occasionally

kshieh1 · Answer 2 · Fri Apr 14 2023 11:27:59 GMT+0800 (China Standard Time)

Thanks. I think I have pushed VRAM usage on edge -- maybe torch need some extra room to maneuver...

(Updated Apr. 17) OOM occurs even if just prompt embeddings were built repeatedly w/o running inference (i.e., images = pipeline(...) has been commented out). torch.cuda.empty_cache() does not help.

Damian Stewart · Answer 3 · Tue Apr 25 2023 16:58:37 GMT+0800 (China Standard Time)

urgh. idk. i also don't have a local gpu to readily debug this. have you tried tearing down the compel instance and making a new one for each prompt?

kshieh1 · Answer 4 · Wed Apr 26 2023 14:31:01 GMT+0800 (China Standard Time)

Interesting. I run the same test on Google Colab (GPU w/ 12G VRAM) and no OOM issue occured. Then I updated my local envrionment with exact same package versions (e.g., torch, diffusers, compel, ... etc) like the Colab however OOM issue still occurs. Local test was on Nvidia GPU with 4G and 8G, btw.

init & delete compel instance inside the loop doesn't help, fyi

jbhurruth · Answer 5 · Thu May 25 2023 03:54:48 GMT+0800 (China Standard Time)

@kshieh1 Did you ever figure out a solution to this? I'm also hitting my 6GB limit as soon as I use the compel embeddings

kshieh1 · Answer 6 · Thu May 25 2023 08:11:08 GMT+0800 (China Standard Time)

@kshieh1 Did you ever figure out a solution to this? I'm also hitting my 6GB limit as soon as I use the compel embeddings

No luck so far

kshieh1 · Answer 7 · Fri May 26 2023 09:21:42 GMT+0800 (China Standard Time)

I think I have come out a solution. After image generation, you should explictly de-reference the tensor object (i.e., prompt_embeds = None) and call gc.collect()

Damian Stewart · Answer 8 · Sat May 27 2023 01:21:30 GMT+0800 (China Standard Time)

ahh nice. i'll add a note on the readme for the next version. thanks for sharing your solution!

Damian Stewart · Answer 9 · Sat Jun 03 2023 14:42:02 GMT+0800 (China Standard Time)

The readme has been updated.

Damian Stewart · Answer 10 · Wed Jul 05 2023 06:02:47 GMT+0800 (China Standard Time)

@kshieh1 we encountered a possibly related (possibly the same?) problem in InvokeAI, which was resolved by doing calls to Compel inside a with torch.no_grad(): block. did you try this?

kshieh1 · Answer 11 · Wed Jul 05 2023 09:50:28 GMT+0800 (China Standard Time)

Yeah, I just did a quick test and found the amount of cuda memory allocation is stable -- I think I can get rid of those costly gc.collect() operations from my code.

Thanks for sharing.