Need explanation on when to to use af.eval() and af.sync()

Question

Need explanation on when to to use af.eval() and af.sync()

itsnarsi opened this issue 7 years ago · comments

I am using a function which uses af.broadcast inside it.

When I use the program in windows PC, I do not see any slowdown of the processing using CPU. However, when I use my linux PC which has both CPU and AMD RX 480 GPU, I see the code slow down drastically.

In Issue#123 @pavanky mentioned that to use af.eval and af.sync when using broadcast function.

Can any one given me a simple explanation of this if possible.

Thanks

EDIT(June 30 2017): The problem was I did not update my arrayfire to newest version on linux.

Pavan Yalamanchili · Answer 1 · Fri May 19 2017 03:38:19 GMT+0800 (China Standard Time)

@itsnarsi you shouldnt need either in real code. You only need it when you are benchmarking your code.

Care to share some code about where you are seeing this?

Narsi Reddy · Answer 2 · Fri May 19 2017 03:55:05 GMT+0800 (China Standard Time)

Ok I need to backup and given proper explanation of what I am doing:
I am trying to create a AutoDifferentiation on top of arrayfire.
LearnAF

I was able to successfully create the outline and was able to execute it properly.
Then I went on to create a MLP using this auto-differentiation. Started working successfully on my windows PC(CPU only i7 6th gen).

Jupyter Notebook

I have another PC with i7 6th gen and RX 480. When I test both CPU and GPU('opencl'), as the batches go by the speed reduces drastically.

problems at: matmul
broadcast and unbroadcast

PS: Sorry for bad syntax and no comments, I'm just still in very starting stages of the code.

Pavan Yalamanchili · Answer 3 · Fri May 19 2017 03:57:37 GMT+0800 (China Standard Time)

Hmm, you definitely do not need af.eval or af.sync in your case.

Narsi Reddy · Answer 4 · Fri May 19 2017 03:58:32 GMT+0800 (China Standard Time)

Can you help me understand exactly how they work?

Umar Arshad · Answer 5 · Fri May 19 2017 04:03:55 GMT+0800 (China Standard Time)

@itsnarsi I sent you an explanation of the two functions on gitter.

ArrayFire creates kernels on the fly using JIT so many element wise operations are not executed until the result is required by the user or non-JITed functions. Eval tell the JIT to evaluate the kernels associated with the array. This is not a synchronous operation so it will return right away. sync on the other hand will block the calling thread until all operations are complete. This does not include unevaluated JIT kernels

Narsi Reddy · Answer 6 · Fri May 19 2017 04:06:37 GMT+0800 (China Standard Time)

@umar456 thanks for the info.

I'll try to implement these in my code and let you guys know.

Thanks

Narsi Reddy · Answer 7 · Fri May 19 2017 05:00:13 GMT+0800 (China Standard Time)

@umar456 @pavanky Ok damn. I modified code to with af.eval() and af.sync() and now on PC the performance literally double!!! (Close to Keras + Tensorflow on CPU)

While on linux I see a improvement on PC, but it is not what I hoped for.
Then:

New:

Is there any reason for this to happen?
When it comes to 'opencl', the code is still lagging behind a lot.

Pavan Yalamanchili · Answer 8 · Fri May 19 2017 05:39:52 GMT+0800 (China Standard Time)

@itsnarsi I'll give it a try once I get home. Ping me if I don't respond in a couple of days.

Pavan Yalamanchili · Answer 9 · Fri May 19 2017 05:52:43 GMT+0800 (China Standard Time)

@itsnarsi can you point me to the script / file you are running as well as the output of af.info() from both machines ?

Narsi Reddy · Answer 10 · Fri May 19 2017 05:57:29 GMT+0800 (China Standard Time)

Notebook I'm running: Here

My Windows PC:

>>> af.info()
ArrayFire v3.4.2 (CPU, 64-bit Windows, build 2da9967)
[0] AMD     : Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, 32669 MB

Linux:

>>> af.info()
ArrayFire v3.4.1 (OpenCL, 64-bit Linux, build b9055b1)
[0] AMD     : Ellesmere, 7731 MB
-1- AMD     : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, 15996 MB

Pavan Yalamanchili · Answer 11 · Fri May 19 2017 05:59:22 GMT+0800 (China Standard Time)

@itsnarsi can you install 3.4.2 on your Linux machine as well ? I had a couple of JIT related fixes in 3.4.2 that may explain some of this.

Narsi Reddy · Answer 12 · Fri May 19 2017 06:01:19 GMT+0800 (China Standard Time)

@pavanky as soon as I posted it I realized that I did not update my arrayfire in linux.

Sorry, I will do that and send you the results as soon as possible.

Narsi Reddy · Answer 13 · Fri May 19 2017 06:15:18 GMT+0800 (China Standard Time)

@pavanky Ok Thanks for remaining me to update.
CPU on both linux and windows are working as needed.
Same as windows in linux NOW!!

In case of 'opnecl' on my RX 480:

First epoch gave really low iterations but after that I got around 90 batches per second.

Pavan Yalamanchili · Answer 14 · Fri May 19 2017 06:19:40 GMT+0800 (China Standard Time)

If I am reading this correctly you are getting 96 batches / s on the GPU but you are getting 325 batches / s on the CPU ?

Pavan Yalamanchili · Answer 15 · Fri May 19 2017 06:19:56 GMT+0800 (China Standard Time)

@itsnarsi are you using floats or doubles for your experiment?

Narsi Reddy · Answer 16 · Fri May 19 2017 06:21:25 GMT+0800 (China Standard Time)

@pavanky I am converting a float32 numpy to af.array using 'af.np_to_af_array' in all cases.

Pavan Yalamanchili · Answer 17 · Fri May 19 2017 06:21:39 GMT+0800 (China Standard Time)

@itsnarsi I am also unsure when the data copy occurs. If you are constantly copying data to the GPU then the times may be skewed because of this.

Pavan Yalamanchili · Answer 18 · Fri May 19 2017 06:28:03 GMT+0800 (China Standard Time)

@itsnarsi sent you a personal message on gitter can you reply to me there :)