tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.

Home Page:https://js.tensorflow.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clearing texture cache manually to avoid memory leaks

justadudewhohacks opened this issue · comments

A while ago I filed the issue #604 with a description how variably sized input tensors lead to memory leaks utilizing the WebGL backend.

@nsthorat answered the following, which makes total sense to me:

Ah yes, this is because we cache textures based on their physical shape, you are basically purposefully getting cache misses every single time. We've found that that's usually pretty rare. Resizing to a fixed input size will absolutely fix the problem :)

My question is, whether (and how) it is possible to clear that cache during runtime. The reason I am asking is, that face-api.js runs multiple unit tests of models with different input sizes. The problem I am facing is, that everytime a new input size is allocated GPU memory grows and there seems no way to free the allocated memory (to clear the cache?).

I want to upgrade face-api.js' dependency on tfjs-core from version 0.14.2 to latest, but it seems that with 1.0.1 the amount of GPU memory allocated is significantly higher compared to 0.14.2, which leads to crashing the WebGL backend when running the unit tests with tfjs-core 1.0.1.

So if there is any way to clear that cache, or if it is possible to implement such mechanism I would be thankful to know. Any hints, where to find that caching mechanism in the code base would also be very helpful to me.

Hi @justadudewhohacks!

So there are two issues to why the programs will be slower when you have different size inputs:

  1. Texture cache misses because of different texture sizes. Hopefully fixing this issue will mitigate this: #1074
  2. Program recompilations because of different input / output tensor shapes. Hopefully fixing this issue will mitigate this: #1264

cc @EmilyReif this would be a good model to speed up w.r.t program recompilation.

In the short term, @annxingyuan, is this something you could look into? Something changed in 1.0 which crashes the face-api unit tests.

Thanks for the quick reply!

Just to clarify, the main issue I am facing is not the slower execution due to cache misses, but rather the memory leak. At some point one simply runs out of GPU memory having different input sizes, because the allocated memory is not freed up.

This is an issue in both 0.14.2 as well as 1.0.1, it just happens that 1.0.1 hits the memory limited much sooner, which I am not sure why that is.

@justadudewhohacks Thanks for filing this. I took a look at the earlier issue you mentioned and it looks like you created a standalone HTML page to reproduce: https://github.com/justadudewhohacks/tfjs-tensor-size-memoryleak-issue

I wanted to confirm that this page reproduces the issue here as well?

@justadudewhohacks Hey - one possible explanation for 1.0's higher memory usage is that we've turned on im2col by default for convolutions, which results in potentially very large textures being allocated. You can disable im2col by adding tf.ENV.set('WEBGL_CONV_IM2COL', false); right after you load tfjs - does this alleviate the issue somewhat?

Hi @annxingyuan,

@justadudewhohacks Thanks for filing this. I took a look at the earlier issue you mentioned and it looks like you created a standalone HTML page to reproduce: https://github.com/justadudewhohacks/tfjs-tensor-size-memoryleak-issue

I wanted to confirm that this page reproduces the issue here as well?

Yes, I just checked this demo again. Updating to tfjs-core to latest (1.0.2) you can still reproduce this issue with the demo. You should see the GPU memory grow after clicking the RUN button a few times:
image


@justadudewhohacks Hey - one possible explanation for 1.0's higher memory usage is that we've turned on im2col by default for convolutions, which results in potentially very large textures being allocated. You can disable im2col by adding tf.ENV.set('WEBGL_CONV_IM2COL', false); right after you load tfjs - does this alleviate the issue somewhat?

Good catch, thanks! Indeed setting tf.ENV.set('WEBGL_CONV_IM2COL', false); helps. Disabling this feature the unit tests end up consuming ~1.2GB of GPU memory, while having this enabled by default consumes ~1.8GB GPU memory, sometimes crashing the unit tests with WebGL exceptions on my system.

Hey @justadudewhohacks - as a stopgap measure just so you can get your unit tests to pass, you could manually create a backend in your describe block, and then call backend.dispose() (which will call WebGLContext.deleteTexture on each texture that's been created). This is essentially the same as clearing the texture cache manually.

We manually create backends as part of our unit tests - you can refer to https://github.com/tensorflow/tfjs-core/blob/master/src/kernels/backend_cpu_test.ts#L33 for an example.

Hopefully this helps!

@nsthorat I'm going to close this issue - I let @justadudewhohacks know about a workaround for manually clearing texture cache, and we have several issues tracking improving memory management:

#1250
#1074
#1264

@justadudewhohacks - hopefully you're no longer blocked on upgrading to 1.0!

@annxingyuan, thanks for the tip with creating own backends for the unit tests. I just tried it out and it looks like backend.dispose() actually is exactly what I was looking for!

@annxingyuan one additional question. Does disposing a newly created WebGL Backend have side effects?

Simply doing the following:

const backend = new tf.webgl.MathBackendWebGL()
backend.dispose()

Causes any further prediction to fail with the following error:

[.WebGL-0000000004AB9D10]GL ERROR :GL_INVALID_OPERATION : glDrawArrays: bound to target 0x8893 : no buffer

Not exactly sure which operations are causing this error to happen though, will have to investigate in this.

When you do that I think there is no active backend, are you registering and setting a new backend after you dispose the old one?

@nsthorat, just executing these two lines of code (creating a new WebGL backend and immediately disposing it again) apparently crashes the backend, that is registered by default. I am not touching the default backend.

Maybe I am missing something, but I assumed that creating a new backend like this would leave the old backend untouched.

Hey @justadudewhohacks - you're right - we share WebGL contexts between backends (for testing efficiency) so if you call dispose on a new backend it may cause issues.

Like Nikhil mentioned, if you create a new backend after disposing the previous one, your issue should go away.

e.g.

let backend = new tf.webgl.MathBackendWebGL()
backend.dispose()
backend = new tf.webgl.MathBackendWebGL()

This is admittedly somewhat surprising behavior :) But hopefully you don't have a use case for creating then disposing a backend without creating a new backend afterwards.

Just to follow up on why this we did it this way: backends are global singletons. If you kill a backend, you need to set a new one since we don't have any mechanism for a disposed backend to tell the engine that it's dead.

Most of the time users don't see this stuff anyways so that's why this is the way it is :)

Ahh okay. So basically calling new tf.webgl.MathBackendWebGL() after every backend.dispose() seemed to fix it. Kind of confusing but it works :D

Just figured that creating the new backends with the gpgpu context of the initial backend seems to work as well:

const gpgpuContext = tf.ENV.backend['gpgpu']
new tf.webgl.MathBackendWebGL(gpgpuContext)

Thanks again for your help!

If you want to run unit tests efficiently why don’t you try using our describeWithFlags utility that lives on jasmine_util.ts?