Slim multi-gpu performance problems

Question

Slim multi-gpu performance problems

Ettard opened this issue 7 years ago · comments

I was using slim models with flower dataset in Ubuntu 16.04.

Tensorflow version:1.1.0rc2 from src

git version:
34c738cc6d3badcb22e3f72482536ada29bd0e65

Bazel version:
Build label: 0.4.5
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Thu Mar 16 12:19:38 2017 (1489666778)
Build timestamp: 1489666778
Build timestamp as int: 1489666778

CUDA version: 8.0.44
cuDNN version:5.1.5
GPU: 3GPUs. All of them are GeForce GTX 1080Ti 11GB
Memory: 32GB

I didn't change source code.
with 1 GPU:

TRAIN_DIR=/tmp/train_logs
DATASET_DIR=/home/l/data/flowers
python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=flowers --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=inception_resnet_v2
…
(log here is same as running with 3 gpus)
…
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.2313 (0.96 sec/step)
INFO:tensorflow:global step 20: loss = 3.7792 (0.97 sec/step)
INFO:tensorflow:global step 30: loss = 2.9681 (0.96 sec/step)
INFO:tensorflow:global step 40: loss = 3.8321 (0.97 sec/step)
INFO:tensorflow:global step 50: loss = 3.2210 (0.96 sec/step)
...

when I use 3 gpus:
python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=flowers --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=inception_resnet_v2 --num_clones=3
2017-04-24 14:26:11.885411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:05:00.0
Total memory: 10.91GiB
Free memory: 10.53GiB
2017-04-24 14:26:11.885472: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x5b62c2c0
2017-04-24 14:26:12.131777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 1 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:06:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-04-24 14:26:12.131848: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x5945f2d0
2017-04-24 14:26:12.369331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 2 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:09:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-04-24 14:26:12.371583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2
2017-04-24 14:26:12.371596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y Y Y
2017-04-24 14:26:12.371601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1: Y Y Y
2017-04-24 14:26:12.371606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2: Y Y Y
2017-04-24 14:26:12.371615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:05:00.0)
2017-04-24 14:26:12.371622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Graphics Device, pci bus id: 0000:06:00.0)
2017-04-24 14:26:12.371625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Graphics Device, pci bus id: 0000:09:00.0)
INFO:tensorflow:Restoring parameters from /tmp/train_logs/model.ckpt-0
2017-04-24 14:26:17.426353: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:2 for node 'clone_2/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
2017-04-24 14:26:17.427748: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:1 for node 'clone_1/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
2017-04-24 14:26:17.429099: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:0 for node 'clone_0/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /tmp/train_logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 2.9670 (0.98 sec/step)
INFO:tensorflow:global step 20: loss = 2.9945 (0.99 sec/step)
INFO:tensorflow:global step 30: loss = 3.0432 (0.99 sec/step)
INFO:tensorflow:global step 40: loss = 3.0007 (1.04 sec/step)
INFO:tensorflow:global step 50: loss = 2.8072 (1.03 sec/step)
...

I saw "Ignoring device specification" and the training speed didn't change.
This is nvidia-smi output with 3 gpus.

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1065 G /usr/lib/xorg/Xorg 160MiB |
| 0 1757 G compiz 81MiB |
| 0 14407 C python 10497MiB |
| 1 14407 C python 10729MiB |
| 2 14407 C python 10729MiB |
+-----------------------------------------------------------------------------+

Something Else
I tried inception model with 3 gpus and it worked well with speed boost. There was no "Ignoring device specification" in inception model logs. I'm not sure whether it is the problem.

similar problem:
#1338
tensorflow/tensorflow#8061 (I tried the script in TF1.1.0 and "Ignoring device specification" appeared too. If someone needs details,I will post logs.)

I changed model to inception_v3. It seems nothing changed.
I'm also considering if I can output batch content that may be helpful.

Geoffrey Irving · Answer 1 · Tue Apr 25 2017 00:42:46 GMT+0800 (China Standard Time)

Can you summarize what worked and what didn't work? It is hard for me to follow so much text without any formatting.

Ettard · Answer 2 · Tue Apr 25 2017 10:40:47 GMT+0800 (China Standard Time)

In slim model, when I use 3 gpus, they all run at 98% or 99% etc., but training speed remained the same. The problem seems not appear when I use inception model with 3gpus. The difference I got from log is device specification. I doubt whether it's the problem. @girving

Geoffrey Irving · Answer 3 · Tue Apr 25 2017 23:46:53 GMT+0800 (China Standard Time)

@sguada Any ideas?

Lance Legel · Answer 4 · Wed Apr 26 2017 04:17:46 GMT+0800 (China Standard Time)

@girving @sguada @aselle

Following up on this. I've spent the past month or so hacking additional functionality into an Inception v3 network on top of Slim (e.g. outputting probability distributions as desired here, and as demonstrated elsewhere well enough to extrapolate).

I need to get an Inception v3/v4 network running across 4 GPUs over the next few days on a new data set, and I'd like to help move us forward toward a productive resolution here that also makes sense for resolving other issues...

To start with @Ettard seems to suggest that multi-GPU is currently working with the Inception repo but not the Slim repo? If so that would be ironic given that @nealwu committed this warning to the Inception README.md:

NOTE: For the most part, you will find a newer version of this code at models/slim. In particular:
inception_train.py and imagenet_train.py should no longer be used. The slim editions for running on multiple GPUs are the current best examples.

As a parallel issue, after 50+ hours invested into Slim, I can say I like its spirit, but training a new dataset seems relatively tedious, at least for me. This may be my oversight, but Slim doesn't seem to have a very useful script like build_image_data.py in Inception which is designed for quickly pre-processing new datasets. In any case, it's possible I'm not using the dataset_factory.py with as much automation as was intended, so please forgive me if I'm missing something!

To consider these issues, should we "abandon" Inception as the above note from @nealwu may suggest? If not, perhaps we should revise that note, and consolidate efforts here more effectively, to reflect our intent for an efficient relationship between Inception and Slim?

In any case, I'm excited to train an Inception network on 4 GPUs these next few days, and am willing to contribute fixes / report back findings as may be helpful, with your high-level support...

Neal Wu · Answer 5 · Wed Apr 26 2017 09:37:49 GMT+0800 (China Standard Time)

Adding @tfboyd. Also see the discussion in #1009.

Ettard · Answer 6 · Thu May 11 2017 18:39:18 GMT+0800 (China Standard Time)

I noticed the discussion in #1428 . It reminded me that though training with multi GPUs, the training speed is measured by steps rather than images per second. I wonder if it's possible to show number of images per step, or I just misunderstood.

Neal Wu · Answer 7 · Fri May 12 2017 02:13:34 GMT+0800 (China Standard Time)

Images per step is also known as batch size. In train_image_classifier.py it is 32 by default.

Ettard · Answer 8 · Fri May 12 2017 10:05:38 GMT+0800 (China Standard Time)

So with 3 GPUs it trains 96 images per step?
The other question is, why it shows "Ignoring device specification /device:GPU:X for node 'clone_X/fifo_queue_Dequeue'"? Does it affect performance?

Rengan Xu · Answer 9 · Fri May 12 2017 12:36:07 GMT+0800 (China Standard Time)

Hi @Ettard @legel I also noticed the similar issue. Using multiple GPUs does not make the training faster.
I was training inception-v3 and the setting is as follows:
1 GPU: batch_size=32, steps=20000
2 GPUs: batch_size=64, steps=10000
4 GPUs: batch_size=128, steps=5000

The result is:
1 GPU: 0.6 sec/step which is 61.5 images/sec
2 GPUs: 0.99 sec/step which is 64.6 images/sec
4 GPUs: 2 sec/step which is 64.5 images/sec

So there is no improvement when using multiple GPUs. Not sure why.

Zehao Shi · Answer 10 · Mon May 15 2017 11:55:12 GMT+0800 (China Standard Time)

Have the same problem.

g21589 · Answer 11 · Tue May 16 2017 15:09:38 GMT+0800 (China Standard Time)

In train_image_classifier.py,
batch_queue = slim.prefetch_queue.prefetch_queue(...) is first assigned to /device:CPU:0 in the deploy_config.inputs_device() block.
However, in clone_fn, the batch_queue.dequeue() assigns to /device:GPU:X.

This cause the confusion when assigning device. So, TensorFlow will ignore the device specification and show the "Ignoring device specification /device:GPU:X for node 'clone_X/fifo_queue_Dequeue'" message.

myagues · Answer 12 · Fri May 19 2017 00:03:53 GMT+0800 (China Standard Time)

I don't think there is a problem with the multi-GPU setup, but there are some metrics that create confusion. This comment states that you should not focus on the sec/step value, but also take into account the amount of GPUs (# of clones) that you are using, which has an impact on the total images per step.

Consider that model_deploy.py is an implementation of local synchronous training. It builds a copy of the model in each GPU and feeds a different batch to each one. When the gradients in all cloned models have been computed, they are summed (_sum_clones_gradients).

That's why the sec/step increases slightly when you add more GPUs, since the CPU has now to feed data to more GPUs and has to aggregate the gradients of each clone before updating the model. However, if you look at the images that have been processed per step, taking into account @Ettard numbers and the default batch size:
1 GPU -> 32 img/step (each GPU) -> 0.97 sec/step -> 1 * 32 / 0.97 = 33 img/sec
3 GPU -> 32 img/step (each GPU) -> 1.04 sec/step -> 3 * 32 / 1.04 = 92 img/sec

With synchronous training your effective batch size per iteration is bigger which speeds up the learning process (loss values for 1 GPU are higher than for 3 GPUs) and makes it converge faster.

If you wanted to decrease the sec/step with multi-GPU, you could implement some kind of local asynchronous training (having a parameter server on the CPU and a worker in each GPU), but it will probably perform worse on the learning process and I don't think it will have a better img/sec, since you are near linear speedup with the number shown above.

Rengan Xu · Answer 13 · Fri May 19 2017 04:17:49 GMT+0800 (China Standard Time)

Hi @myagues, in @Ettard numbers, no batch_size is defined. The default batch size is 32. So he was using batch size 32 for both 1 GPU and 3 GPUs. And one batch images are divided into all GPUs. It does not mean one GPU processes one batch images, it means all GPUs process different images in one batch at the same time. For me, I actually do not focus on images/sec, I focus on the total wall clock training time.

In my experiment:
1 GPU: batch_size=32, steps=20000
2 GPUs: batch_size=64, steps=10000
4 GPUs: batch_size=128, steps=5000

The result is:
1 GPU: 0.6 sec/step which is 61.5 images/sec, training time=89m59.320s
2 GPUs: 0.99 sec/step which is 64.6 images/sec, training time=59m58.892s
4 GPUs: 2 sec/step which is 64.5 images/sec, training time=173m46.263s

I don't know why the training time varies a lot while the images/sec numbers are almost the same. And the performance with 4 GPUs is much slower. Not sure why.

Jason Wu · Answer 14 · Mon Jun 12 2017 17:00:02 GMT+0800 (China Standard Time)

@Ettard have you figured out your problem?

Toby Boyd · Answer 15 · Fri Jan 19 2018 14:15:03 GMT+0800 (China Standard Time)

Marking closed. SLIM has one mode for multi-GPU and that is putting all of the parameters on GPU:0, this is problematic, more so with faster GPUS, and even more so if there is a lack of GPU Direct peer-to-peer and even with that on P100 NVLink systems not all GPUs are directly connected. The updated estimator offers viable solutions and if someone wants to retro fit SLIM or even just off a 5 line parameter-server=CPU that would offer a boost as well.