Memory leak due to the copy of Metric objects in Composer's trainer
gregjauvion opened this issue · comments
The function _compute_and_log_metrics in composer.trainer.Trainer
does the following:
- Creates a copy of the metrics (using
copy.deepcopy
) - Computes the values of the metrics on the copy
It is called 3 times in composer.trainer.Trainer
:
- At every batch in the training loop to compute the metrics on this batch, and at the end of the epoch
- At the end of the evaluation loop to compute the metrics on the evaluation dataset
On a specific use-case I'm working on where the Metric
objects have large states stored on GPU, I notice that all Metric
objects instantiated at every epoch in the training loop and in the evaluation loop are not deleted after training and evaluation, probably because of a reference cycle. This memory leak does not occur when I comment the line metrics = deepcopy(metrics)
. I don't know whether this happens with all implementations of Metric
or if that is specific to the Metric
I have implemented.
I don't see the need for copying the metrics before calling metric.compute()
. In particular, the metrics are reset with metric.reset()
at the start of the training on each batch, and at the start of the evaluation loop, making this copy useless. Is there a specific reason I am missing?
the code you linked is in a github repo that has nothing to do with this one.
This issue tracker is for the Composer project of the PHP ecosystem.
Yes sorry about that I've just realized.