python_tracer_level=1 causes OOM very quickly

Question

python_tracer_level=1 causes OOM very quickly

dobos opened this issue 3 years ago · comments

I'm trying to profile a piece of code written in eager tensorflow. When I turn python tracing on with the python_tracer_level=1 switch, it very quickly eats up the GPU memory. The memory usage basically grows linearly with time. I don't know if the profiler actually uses the GPU RAM to store the trace or it interferes with the garbage collector. It happens with or without tf.config.experimental.set_memory_growth(gpu, True)

Versions:

tensorflow                2.4.1           gpu_py39h8236f22_0  
tensorflow-base           2.4.1           gpu_py39h29c2da4_0  
tensorflow-estimator      2.4.1              pyheb71bc4_0  
tensorflow-gpu            2.4.1                h30adc30_0  
tensorflow-probability    0.12.1             pyhd8ed1ab_0    conda-forge

Jie Sun · Answer 1 · Sat Mar 13 2021 07:59:20 GMT+0800 (China Standard Time)

apparently we were hold reference to PyCodeObject during profiling time,
some optimization logic will keep some PyFrameObject alive which means some local variable are not freed until profiler end.
if these variables are bind to gpu memory, then there are leak.

We had internal report of this issue and have been fixed in TF nightly.

Jie Sun · Answer 2 · Sat Mar 13 2021 08:02:50 GMT+0800 (China Standard Time)

tensorflow/tensorflow@9f4ccb7