tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.

Home Page:https://burn.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory leak in Wgpu backend

joshhansen opened this issue · comments

Describe the bug
Memory leaks when using the Wgpu backend, but not when using the NdArray backend.

To Reproduce
A minimal reproduction is available at https://github.com/joshhansen/burnleak.

Check out the repository and cargo run --release on the master branch. Watch the memory usage. In my case, it climbs steadily upward at a rapid pace.

Then check out the master_ndarray branch and run it again. In my case, the memory usage does not climb.

Running on Wgpu but with the WgpuDevice::Cpu device slows but does not eliminate the memory leakage.

Expected behavior
The program should run without a memory leak on the Wgpu backend like it does on the NdArray backend.

Screenshots
n/a

Desktop:

  • OS: Linux Mint 21.3, kernel 6.5.0 x86_64
  • Burn: 0.13.2
  • Rust: 1.76.0
  • Nvidia driver 545.29.06-0ubuntu0.22.04.2

Additional context
Heaptrack memory leak profile:
heaptrack1

heaptrack2

I'm experiencing the same leak when I upgraded from burn 0.12 to burn 0.13.

I tried your repro repo using burn 0.12 as a dep and there is not an aggressive leak, so it seems like a regression.


This is what I saw with my personal hobby project monitoring

image

Thanks for the quick repro @kurtlawrence . I can confirm that downgrading to 0.12.1 results in much better memory usage - there are still leaks coming from wgpu but less by an order of magnitude. (82.6 MB leaks on 1.4 G of heaptrack samples vs. 2.1 GB of leaks on 626 MB of heaptrack samples)

I've been working on the memory management strategy currently implemented in the wgpu runtime on the master branch. The current approach results in higher average memory usage, which is intentional. The new strategy is designed to be lazier in freeing unused memory and more aggressive in reusing it. For dynamic memory workloads, this can lead to performance improvements of up to 60%.

Our assumption is that for most deep learning use cases, the GPU device will be primarily dedicated to the training or inference tasks being performed. Therefore, we've prioritized better performance at the cost of higher average memory usage. While we don't believe this strategy leads to significantly higher peak memory usage or more frequent out-of-memory situations, we recognize that this could be a potential issue.

If average memory usage is a concern, we could consider adding an option for users to configure the memory management behavior in the wgpu runtime.

@kurtlawrence Is it CPU RAM leakage or GPU memory?

@kurtlawrence Is it CPU RAM leakage or GPU memory?

CPU RAM leaking

@mepatrick73 The issue is not the level of memory usage, it's that the memory usage grows without bound. Eventually it would consume all system memory if I didn't terminate the program. I have 128GB system RAM and 4 GPUs with 48 GB VRAM. The task in my case is inference on a small model - each instance consumes 28 MiB of VRAM as per nvidia-smi.