Documentation: Multiple GPUs

Question

Documentation: Multiple GPUs

georgh opened this issue 7 years ago · comments

I think it would be great to have an example for using multiple GPUs.

Here is what I tried. If thats the right way to do it, then you may add it as an example.
It seems to scale fine (tested up to 7 GPUs) and nvidia-smi reports 96% util.

import time
import numpy as np
import arrayfire as af
import argparse
af.set_backend('cuda')

if __name__ == '__main__':
      parser = argparse.ArgumentParser()
      parser.add_argument('gpus', type=int)
      parser.add_argument('-runs', type=int, default=100)
      args = parser.parse_args()

      GPUS = args.gpus
      N = 5000
      runs = args.runs

      # The simple task we want to solve:
      # we have a huge list of vectors X and want to calculate the distance between all of them
      # this will result in a huge distance matrix M
      # the resulting matrix should be multiplied by a vector alpha
      X = np.random.rand(100, N)
      Alpha = np.random.rand(N,1)

      #copy data once:
      xGPU = []
      alphaGPU = []
      for i in range(GPUS):
            af.set_device(i)
            x = af.to_array(X)
            xGPU.append(x)
            alpha = af.to_array(Alpha)
            alphaGPU.append(alpha)

      sub = lambda a,b: a - b 
      print("init finished")
      for _ in range(runs):
            startTime = time.time()
            splitSize = int(np.ceil(N / GPUS))
            #print("Temp data will ocupy at least {:.2f} MB on the gpu.".format((X.shape[0] * splitSize * X.shape[1]) *8 /1024/1024))

            result = []
            for i in range(GPUS):
                  af.set_device(i)
                  x = xGPU[i]
                  alpha = alphaGPU[i]

                  start = i*splitSize
                  end = min((i+1)*splitSize, N) 

                  diff = af.broadcast(sub, af.tile(x[:,start:end],1,1,x.shape[1]), af.moddims(x,x.shape[0],1,x.shape[1]))
                  diff = af.sqrt(af.sum(af.pow(diff,2),0) )
                  r = af.matmul(af.moddims(diff, diff.shape[1], diff.shape[2]), alpha)
                  result.append(r)

            total = 0
            for i in range(GPUS):
                  af.set_device(i)
                  total += af.sum(result[i])

            print("Took {} sec".format(time.time() - startTime ))

georgh · Answer 1 · Wed Dec 06 2017 21:22:53 GMT+0800 (China Standard Time)

Main remaining question would be, how you use the cpu during the gpu computation.
Do you split with multiprocessing or is there an easier way?

Pavan Yalamanchili · Answer 2 · Thu Dec 07 2017 01:05:28 GMT+0800 (China Standard Time)

@georgh can you send this as a PR ?

Main remaining question would be, how you use the cpu during the gpu computation.

The gpu ops are asynchronous. You can other stuff on the cpu as long as you dont run any synchronizing functions (af.sync or any function that copies memory back to the cpu).