tensorflow / profiler

A profiling and performance analysis tool for TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

python_tracer_level=1 causes OOM very quickly

dobos opened this issue · comments

I'm trying to profile a piece of code written in eager tensorflow. When I turn python tracing on with the python_tracer_level=1 switch, it very quickly eats up the GPU memory. The memory usage basically grows linearly with time. I don't know if the profiler actually uses the GPU RAM to store the trace or it interferes with the garbage collector. It happens with or without tf.config.experimental.set_memory_growth(gpu, True)

Versions:

tensorflow                2.4.1           gpu_py39h8236f22_0  
tensorflow-base           2.4.1           gpu_py39h29c2da4_0  
tensorflow-estimator      2.4.1              pyheb71bc4_0  
tensorflow-gpu            2.4.1                h30adc30_0  
tensorflow-probability    0.12.1             pyhd8ed1ab_0    conda-forge

apparently we were hold reference to PyCodeObject during profiling time,
some optimization logic will keep some PyFrameObject alive which means some local variable are not freed until profiler end.
if these variables are bind to gpu memory, then there are leak.

We had internal report of this issue and have been fixed in TF nightly.