arrayfire / arrayfire-python

Python bindings for ArrayFire: A general purpose GPU library.

Home Page:https://arrayfire.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Need explanation on when to to use af.eval() and af.sync()

itsnarsi opened this issue · comments

I am using a function which uses af.broadcast inside it.

When I use the program in windows PC, I do not see any slowdown of the processing using CPU. However, when I use my linux PC which has both CPU and AMD RX 480 GPU, I see the code slow down drastically.

In Issue#123 @pavanky mentioned that to use af.eval and af.sync when using broadcast function.

Can any one given me a simple explanation of this if possible.

Thanks

EDIT(June 30 2017): The problem was I did not update my arrayfire to newest version on linux.

@itsnarsi you shouldnt need either in real code. You only need it when you are benchmarking your code.

Care to share some code about where you are seeing this?

Ok I need to backup and given proper explanation of what I am doing:
I am trying to create a AutoDifferentiation on top of arrayfire.
LearnAF

  • I was able to successfully create the outline and was able to execute it properly.

  • Then I went on to create a MLP using this auto-differentiation. Started working successfully on my windows PC(CPU only i7 6th gen).

Jupyter Notebook
fireshot capture 3 - layers - http___localhost_8888_notebooks_layers ipynb

  • I have another PC with i7 6th gen and RX 480. When I test both CPU and GPU('opencl'), as the batches go by the speed reduces drastically.

problems at: matmul
broadcast and unbroadcast

PS: Sorry for bad syntax and no comments, I'm just still in very starting stages of the code.

Hmm, you definitely do not need af.eval or af.sync in your case.

Can you help me understand exactly how they work?

@itsnarsi I sent you an explanation of the two functions on gitter.

ArrayFire creates kernels on the fly using JIT so many element wise operations are not executed until the result is required by the user or non-JITed functions. Eval tell the JIT to evaluate the kernels associated with the array. This is not a synchronous operation so it will return right away. sync on the other hand will block the calling thread until all operations are complete. This does not include unevaluated JIT kernels

@umar456 thanks for the info.

I'll try to implement these in my code and let you guys know.

Thanks

@umar456 @pavanky Ok damn. I modified code to with af.eval() and af.sync() and now on PC the performance literally double!!! (Close to Keras + Tensorflow on CPU)
Windows

While on linux I see a improvement on PC, but it is not what I hoped for.
Then:
Then

New:
New

Is there any reason for this to happen?
When it comes to 'opencl', the code is still lagging behind a lot.

@itsnarsi I'll give it a try once I get home. Ping me if I don't respond in a couple of days.

@itsnarsi can you point me to the script / file you are running as well as the output of af.info() from both machines ?

Notebook I'm running: Here

My Windows PC:

>>> af.info()
ArrayFire v3.4.2 (CPU, 64-bit Windows, build 2da9967)
[0] AMD     : Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, 32669 MB

Linux:

>>> af.info()
ArrayFire v3.4.1 (OpenCL, 64-bit Linux, build b9055b1)
[0] AMD     : Ellesmere, 7731 MB
-1- AMD     : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, 15996 MB

@itsnarsi can you install 3.4.2 on your Linux machine as well ? I had a couple of JIT related fixes in 3.4.2 that may explain some of this.

@pavanky as soon as I posted it I realized that I did not update my arrayfire in linux.

Sorry, I will do that and send you the results as soon as possible.

@pavanky Ok Thanks for remaining me to update.
CPU on both linux and windows are working as needed.
Same as windows in linux NOW!!
1

In case of 'opnecl' on my RX 480:
2

  • First epoch gave really low iterations but after that I got around 90 batches per second.

If I am reading this correctly you are getting 96 batches / s on the GPU but you are getting 325 batches / s on the CPU ?

@itsnarsi are you using floats or doubles for your experiment?

@pavanky I am converting a float32 numpy to af.array using 'af.np_to_af_array' in all cases.

@itsnarsi I am also unsure when the data copy occurs. If you are constantly copying data to the GPU then the times may be skewed because of this.

@itsnarsi sent you a personal message on gitter can you reply to me there :)