Tensors are leaked when `model.save()` includes the optimizer

Question

Tensors are leaked when `model.save()` includes the optimizer

Vectorrent opened this issue 2 months ago · comments

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): False
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux
TensorFlow.js installed from (npm or script link): NPM
TensorFlow.js version (use command below): 4.17.0

Describe the current behavior
When using tensorflow-node-gpu for training, I periodically save models to disk. However, my training has been crashing, and I've just learned why:

When model.save() includes the optimizer, a single tensor is leaked. This leads to the slow accumulation of unnecessary tensors, and crashes my computer after some amount of time:

await model.save(`file://saved_model`, { includeOptimizer: true })

To be clear, this is before saving a model:

{ unreliable: true, numTensors: 18, numDataBuffers: 18, numBytes: 420 }

And this is after:

{ unreliable: true, numTensors: 19, numDataBuffers: 19, numBytes: 424 }

Describe the expected behavior
I would expect model-saving to dispose of all unused tensors, after the operation is complete.

Standalone code to reproduce the issue
This bug is 100% reproducible in both tfjs-node and tfjs-node-gpu:

import fs from 'fs'
import * as tf from '@tensorflow/tfjs-node'

const model = tf.sequential()
model.add(tf.layers.dense({ units: 10, inputShape: [1] }))
model.add(tf.layers.dense({ units: 1 }))

model.compile({
    optimizer: 'adam',
    loss: 'meanSquaredError'
})

const xs = tf.tensor2d([1, 2, 3, 4], [4, 1])
const ys = tf.tensor2d([2, 4, 6, 8], [4, 1])

fs.mkdirSync('./saved_model', { recursive: true })

model.fit(xs, ys, {
    epochs: Infinity,
    verbose: 0,
    callbacks: {
        onEpochEnd: async (epoch, logs) => {
            console.clear()
            console.log(epoch)
            console.log(tf.memory())
            if (epoch % 1000 === 0 && epoch !== 0) {
                await model.save(`file://saved_model`, {
                    includeOptimizer: true
                })
            }
        }
    }
})

Other info / logs

There are no logs to provide, because TFJS OOM issues cause my computer to hard-freeze; they require a forcible shutdown to recover from.
If the includeOptimizer flag is disabled, then this does not occur.

gaikwadrahul8 · Answer 1 · Thu Apr 11 2024 05:39:06 GMT+0800 (China Standard Time)

Hi, @Vectorrent

Thank you for bringing this issue to our attention and I was trying to replicate the same behaviour from my end on my macOS and I'm getting below output with includeOptimizer: true flag and as you mentioned that issue not happening with includeOptimizer: false so I also observed same thing so workaround is either disable includeOptimizer flag when saving the model. This avoids saving the optimizer state, preventing the leak. However, you'll need to recreate the optimizer during model loading or TensorFlow.js provides functions for manual memory management. You can try the following approach after each save please refer official documentation for tf.tidy and tf.dispose

await model.save(`file://saved_model`, { includeOptimizer: true });

// Manually dispose of the optimizer
model.optimizer.dispose();

// Dispose of other unused tensors
tf.dispose(xs);
tf.dispose(ys);

Please let me know if I have missed anything here. Thank you for your cooperation and patience.

Ink · Answer 2 · Thu Apr 11 2024 06:04:09 GMT+0800 (China Standard Time)

Thanks for the quick response. Sadly, tf.tidy() has no effect and tf.dispose() crashes my training session (for obvious reasons). So, neither of these are a "solution" and we should probably fix the underlying bug in the library. I might have some time to dig into the TFJS code and troubleshoot that, at some point.

Until then, my solution is to 1) create a manual training loop, 2) save the model, 3) unload the model, 4) re-load the model, 5) resume training. Not a great solution, if you ask me 🤣

Ink · Answer 3 · Fri Apr 12 2024 03:31:32 GMT+0800 (China Standard Time)

I cannot for the life of me figure out how to build TFJS locally on my computer, so I'm not really able to debug or test this properly. Regardless, I've been digging, and this is probably where we need to apply a fix:
https://github.com/tensorflow/tfjs/blob/master/tfjs-layers/src/engine/training.ts#L2146

If I had to guess, maybe its related to the use of io.concatenateArrayBuffers here? Apparently, it's deprecated and we should be using tf.io.CompositeArrayBuffer.join() instead.