python_tracer_level=1 causes OOM very quickly
dobos opened this issue · comments
I'm trying to profile a piece of code written in eager tensorflow. When I turn python tracing on with the python_tracer_level=1
switch, it very quickly eats up the GPU memory. The memory usage basically grows linearly with time. I don't know if the profiler actually uses the GPU RAM to store the trace or it interferes with the garbage collector. It happens with or without tf.config.experimental.set_memory_growth(gpu, True)
Versions:
tensorflow 2.4.1 gpu_py39h8236f22_0
tensorflow-base 2.4.1 gpu_py39h29c2da4_0
tensorflow-estimator 2.4.1 pyheb71bc4_0
tensorflow-gpu 2.4.1 h30adc30_0
tensorflow-probability 0.12.1 pyhd8ed1ab_0 conda-forge
apparently we were hold reference to PyCodeObject during profiling time,
some optimization logic will keep some PyFrameObject alive which means some local variable are not freed until profiler end.
if these variables are bind to gpu memory, then there are leak.
We had internal report of this issue and have been fixed in TF nightly.