TF examples take over all GPU memory

Question

TF examples take over all GPU memory

sxjscience opened this issue 7 years ago · comments

I find that if we run the TF example, it will try to take over all of the available GPU memory (Refer: https://www.tensorflow.org/tutorials/using_gpu). This can cause troubles in public servers where lots of users are sharing the GPUs.

For example, when running klab-12-5-seq2seq.py, the GPU usage could be like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
| 17%   47C    P2    44W / 200W |   7786MiB /  8113MiB |      9%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
|  0%   41C    P2    43W / 200W |   7715MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:82:00.0     Off |                  N/A |
|  0%   44C    P2    43W / 200W |   7715MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I used the method suggested in https://www.tensorflow.org/tutorials/using_gpu and the memory will be allocated incrementally now.

(Adding the following lines after the import will solve the problem https://github.com/hunkim/DeepLearningZeroToAll/blob/master/Keras/klab-12-5-seq2seq.py#L12)

from keras.utils.vis_utils import plot_model
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

The GPU memory usage becomes:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
|  0%   47C    P2    44W / 200W |    294MiB /  8113MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
|  0%   41C    P8    14W / 200W |    115MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:82:00.0     Off |                  N/A |
|  0%   44C    P8    14W / 200W |    115MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Is there a way to enable this by default?

Sung Kim · Answer 1 · Fri Apr 14 2017 05:36:31 GMT+0800 (China Standard Time)

Interesting.

Do you mean from keras.utils.vis_utils import plot_model help reduce memory?

How about just running TF without Keras. The same memory issue?

Mo Kweon · Answer 2 · Fri Apr 14 2017 05:55:19 GMT+0800 (China Standard Time)

This is a normal behavior in Tensorflow

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

Anyway I don't see a problem because

For people coming to this repo, a situation like this is very rare.
If there isn't enough memory, it will show pool allocations errors. No harm is done
This is only a problem when you run Tensorflow first and try to do other things with the GPU later because there will be no memory left lol! However, if the memory is already occupied, Tensorflow won't run if there is not enough memory. (still will run if there is enough memory) In either case, no harm is done.

Xingjian Shi · Answer 3 · Fri Apr 14 2017 09:12:37 GMT+0800 (China Standard Time)

@hunkim Configuring the memory allocation strategy of TF to be "allow_growth=True" can solve the problem.

@kkweon Currently I'm having the third problem. We will need to run multiple jobs on the same GPU or different GPUs. The default memory allocation behavior will be problematic in such scenario.

However, I feel that it's reasonable to ignore it now since most example scripts take less 1 minute to complete. In fact, I haven't noticed the problem until I begin to run the seq2seq example in Keras, which takes longer time.