arrayfire / arrayfire-python

Python bindings for ArrayFire: A general purpose GPU library.

Home Page:https://arrayfire.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Documentation: Multiple GPUs

georgh opened this issue · comments

I think it would be great to have an example for using multiple GPUs.

Here is what I tried. If thats the right way to do it, then you may add it as an example.
It seems to scale fine (tested up to 7 GPUs) and nvidia-smi reports 96% util.

import time
import numpy as np
import arrayfire as af
import argparse
af.set_backend('cuda')

if __name__ == '__main__':
      parser = argparse.ArgumentParser()
      parser.add_argument('gpus', type=int)
      parser.add_argument('-runs', type=int, default=100)
      args = parser.parse_args()

      GPUS = args.gpus
      N = 5000
      runs = args.runs

      # The simple task we want to solve:
      # we have a huge list of vectors X and want to calculate the distance between all of them
      # this will result in a huge distance matrix M
      # the resulting matrix should be multiplied by a vector alpha
      X = np.random.rand(100, N)
      Alpha = np.random.rand(N,1)

      #copy data once:
      xGPU = []
      alphaGPU = []
      for i in range(GPUS):
            af.set_device(i)
            x = af.to_array(X)
            xGPU.append(x)
            alpha = af.to_array(Alpha)
            alphaGPU.append(alpha)

      sub = lambda a,b: a - b 
      print("init finished")
      for _ in range(runs):
            startTime = time.time()
            splitSize = int(np.ceil(N / GPUS))
            #print("Temp data will ocupy at least {:.2f} MB on the gpu.".format((X.shape[0] * splitSize * X.shape[1]) *8 /1024/1024))

            result = []
            for i in range(GPUS):
                  af.set_device(i)
                  x = xGPU[i]
                  alpha = alphaGPU[i]

                  start = i*splitSize
                  end = min((i+1)*splitSize, N) 

                  diff = af.broadcast(sub, af.tile(x[:,start:end],1,1,x.shape[1]), af.moddims(x,x.shape[0],1,x.shape[1]))
                  diff = af.sqrt(af.sum(af.pow(diff,2),0) )
                  r = af.matmul(af.moddims(diff, diff.shape[1], diff.shape[2]), alpha)
                  result.append(r)

            total = 0
            for i in range(GPUS):
                  af.set_device(i)
                  total += af.sum(result[i])

            print("Took {} sec".format(time.time() - startTime ))
 

Main remaining question would be, how you use the cpu during the gpu computation.
Do you split with multiprocessing or is there an easier way?

@georgh can you send this as a PR ?

Main remaining question would be, how you use the cpu during the gpu computation.

The gpu ops are asynchronous. You can other stuff on the cpu as long as you dont run any synchronizing functions (af.sync or any function that copies memory back to the cpu).