kuza55 / keras-extras

Extra batteries for Keras

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The speed of using multi gpus

freshmanfresh opened this issue · comments

I use Tensorflow as the backend. And I use multi_gpu.py to achieve multi-gpus training. However, I find that the speed of using two gpus is almost same with using one gpu. Besides, I find when I use one gpu, the usage of gpu is almost 100%; but using two gpus, the usage og each gpu is about 40%-60%. How can I solve the problem?

My environment:
CPU: 40x Intel E5-2630 v4
Mem: 384GB
GPU: 4x NVIDIA GTX 1080 Ti

Yes, I observe similar behavior. For single GPU (GTX1070) the time / epoch converges to 4 s, for 2 GPUs to 6 s, whereas the optimum would be 2 s. For more GPUs, the time gets even worse.

When I was comparing the code to CIFAR10 TensorFlow tutorial their code compute gradients in parallel and then averages them at the CPU. This code computes only the predictions in parallel. When I tried to log the operation placement the model parameters are not located on the GPU and also the gpu:0 has much more operations than gpu:1.

Yes, I also find it. That is, if we want to "really" use multi GPUs, we have to use Tensorflow directly? I am wondering is there some ways we can improve the performance of multi GPUs in keras?

I found an interesting observation and actually was able to make the Keras model parallelize well!

The basic model has to be placed on cpu:0 device. By default it's placed on gpu:0.

Working example with MNIST MLP:

'''Trains a simple deep NN on the MNIST dataset.

Gets to 98.40% test accuracy after 20 epochs
(there is *a lot* of margin for parameter tuning).
2 seconds per epoch on a K520 GPU.
'''

from __future__ import print_function

import keras
from keras.datasets import mnist
from keras.models import Model
from keras.layers import Dense, Dropout, Input, Lambda
from keras.layers.merge import concatenate
from keras.optimizers import RMSprop
from keras import backend as K
import os
import tensorflow as tf

# You can check the operation placement (though it's a bit verbose).
# sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# K.set_session(sess)

# gpu_count=1, batch_size=128, (width=1024) -> 11s / epoch

def make_parallel(model, gpu_count):
    def get_slice(data, idx, parts):
        shape = tf.shape(data)
        size = tf.concat([shape[:1] // parts, shape[1:]], axis=0)
        stride = tf.concat([shape[:1] // parts, shape[1:] * 0], axis=0)
        start = stride * idx
        return tf.slice(data, start, size)

    outputs_all = []
    for i in range(len(model.outputs)):
        outputs_all.append([])

    #Place a copy of the model on each GPU, each getting a slice of the batch
    for i in range(gpu_count):
        with tf.device('/gpu:%d' % i):
            with tf.name_scope('tower_%d' % i) as scope:

                inputs = []
                #Slice each input into a piece for processing on this GPU
                for x in model.inputs:
                    input_shape = tuple(x.get_shape().as_list())[1:]
                    slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
                    inputs.append(slice_n)

                outputs = model(inputs)

                if not isinstance(outputs, list):
                    outputs = [outputs]

                #Save all the outputs for merging back together later
                for l in range(len(outputs)):
                    outputs_all[l].append(outputs[l])

    # merge outputs on CPU
    with tf.device('/cpu:0'):
        merged = []
        for outputs in outputs_all:
            merged.append(concatenate(outputs, axis=0))

        return Model(input=model.inputs, output=merged)

gpu_count = len([dev for dev in os.environ.get('CUDA_VISIBLE_DEVICES', '').split(',') if len(dev.strip()) > 0])

batch_size = 128 * gpu_count
num_classes = 10
epochs = 20

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255

print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

with tf.device('/cpu:0'): # important!!!
    input = Input(shape=(784,))
    x = Dense(1024, activation='relu')(input)
    x = Dropout(0.2)(x)
    x = Dense(1024, activation='relu')(x)
    x = Dropout(0.2)(x)
    output = Dense(10, activation='softmax')(x)

    model = Model(inputs=input, outputs=output)

    print('Single tower model:')
    model.summary()

    if gpu_count > 1:
        model = make_parallel(model, gpu_count)

        print('Multi-GPU model:')
        model.summary()

    model.compile(loss='categorical_crossentropy',
                  optimizer=RMSprop(),
                  metrics=['accuracy'])

    history = model.fit(x_train, y_train,
                        batch_size=batch_size,
                        epochs=epochs,
                        verbose=1,
                        validation_data=(x_test, y_test))
    score = model.evaluate(x_test, y_test, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])

After a few epochs the time per epoch stabilizes. Results on GTX 1070 GPUs:

  • 1 GPU: 11 s / epoch
  • 2 GPU: 6 s / epoch
  • 3 GPU: fails (due to #7)
  • 4 GPU: 3 s / epoch

Ahh, in case of 1 GPU, we should leave it on gpu:0, not cpu:0.

Hm... so with this fixed, the basic model on 1 GPU runs at 2 s/ epoch (better than with more GPUs). The slow speed on 1 GPU setting was due to the model running on a CPU actually.

I will check it and tell you the result:
Hm...the speed (290s/epoch) on one GPU (GTX 1080 Ti) is much faster than on two GPUs (813s/epoch)

It seems to me that the problem might be caused by the fact that only predictions are computed in parallel. Then they are moved to the parameter server (cpu:0 or gpu:0) and gradients for the whole batch (gpu_count * batch_size) are computed on a single device! Computing gradients might be expensive. In the TensorFlow CIFAR10 tutorial they compute gradients on each device, move them and on the PS device average them and update weights.

Another difference is that they explicitly do variable sharing via the scopes. I'd assume Keras does something similar under the hood when applying the base model multiple times, but I'm not sure about that.

Another possible pain point might be a bit strange GPU topology on our machine (many cards, PCIe riser). I'll try to run the experiments on some standard cloud VM.

As nicely noted in Caffe docs we have two options in setting batch size:

  • keep total batch_size, each replica gets batch_size / N
  • total batch_size * N, each replica gets batch_size

In the first case we expect lower training time. Due to less efficiency, it might be slower. Thus the preferred is the second case (give each GPU batch size optimal for one GPU).

So in the second case we would expect after parallelization to N GPUs the time per batch to stay the same, while the time per epoch to be ideally 1/N of time for 1 GPU.

@bzamecnik Thanks for looking into this; I thought this should be resolved since I patched Keras to passed in colocate_gradients_with_ops=True to TensorFlow.

I wonder if I missed a place in Keras or if the option isn't working the way I expected it to.

@kuza55 Aah. Thanks for mentioning this! I saw your PR regarding colocate_gradients_with_ops but didn't check its release, so I thought it's been already published before. I'll try to run the experiments on fixed Keras and see if it helps in this case.

Actually, now I see it was merged at 30 Aug 2016, so it should have been released in 1.1.0.

@bzamecnik Hey just trying to understand your version of make_parallel(). Why do you concatenate the model outputs into one "merged" output I sort of expected some sort of averaging operation. Is it because you train each output on it's own batch slice of data and that automatically causes the weight updates to average since they're all being applied to the same model/hidden layers?

Can somebody explain how are gradients calculated for such model? It seems like parallelism only happens during forward pass unless tensorflow magically does parallelism same way for backward pass as well and if it does, where is code to average the gradients?

@shivamkalra to parallelize backpropagation of gradients the colocate_gradients_with_ops flag should have been set to true. This should ensure that the gradient ops are run on the same device as the original op.

@normanheckscher Thanks. Just trying to understand, so gradients are calculated on each slice of the data for backward pass on multiple gpus simultaneously and final descent (update) of parameters happens on CPU? And is the final update average of all the gradients from all the slices?

commented

Hi @kuza55 , where did you pass colocate_gradients_with_ops=True to TensorFlow? I didn't find it anywhere.
Update: I find it in tensorflow_backend.py, and it is already set to True. However, even I set with tf.device('/cpu:0'): as @bzamecnik suggested, GPU 0 has a much higher utility than other gpus.

commented

@kuza55 Thanks. When I set with tf.device('/cpu:0'): as @bzamecnik suggested, GPU 0 still gets a much higher utility than other gpus. Is there an peculiar reason for that?

@shivamkalra What happens under the hood with gradients is a bit opaque, but there's truly an implicit gradient averaging. In the Keras code we just put computing each slice outputs on each GPU, then merge them to CPU and compute loss. If I understand it correctly this is what happens. The loss is the sum of losses on each slice. And gradient of the loss for a batch is the average of gradients for each sample, ie. also for each slice. Thanks to colocate_gradients_with_ops=True gradients for the loss with respect to are computed on each particular GPU. They're moved to CPU and averaged. I'm not sure if they're averaged in two stages or not (first for slice on each GPU, then among slices on CPU).

As for exchanging gradients and weights between PS and GPU, it's very interesting to observe what TF with implicit copies does. I though there will be one big transfer there and one back. No. Not all weights / gradients are needed for computing each layer. So TF only implicitly copies what's necessary and also possibly makes some implicit copies in parallel to previous computations. It means that there's not less time cost for exchanging weights/gradients than expected.

NVIDIA profiler + visual profiler are really good tools to explore what's in fact happening there.

@jiang1st I'm not sure. When I examined the runs with nvprof I usually saw quite balanced load. If there was a disbalance (eg. in the speed of computation), it was typically because of different memory clock setting of the two otherwise same GPUs. Try to compare the placement of operations to GPUs with placement logging, TensorBoard or nvprof.

commented

Thanks @bzamecnik . Will try as you suggested.

I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don't change.

Hi,
I tried the code proposed by @bzamecnik with batch_size=64 and GPUs = 2.
I understand that since I have batch_size=64 and GPUs = 2, each GPU unit will run with batch_size=32, right? But, why the model is running too slowly?
Thanks