benanne / nervana_theano

A rudimentary wrapper around the fast Maxwell kernels for GEMM and convolution operations provided by nervanagpu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Benchmark

pranv opened this issue · comments

Hey, thank you for this. Any comparative timed results?

The conclusion is that we need a c interface to there code to have good speed up. Otherwise, the python/c interface killed the speed up.

@abergeron created a c interface to there gemm, before they made one. It wasn't easy and is hard to maintain. So we prefered not to spend more of our dev time on wrapping there conv kernel.

If someone make a c interface to there conv kernel, we could integrate it in Theano. But otherwise, I don't place to work more on this.

Also, cudnn v2 lowered the difference speed and cudnn v3 will be faster then v2. So I don't know if more speed up will be given by nervana conv compared to v3.

If you have questions/comments, don't hesitate.

cudnn v3 is supposed to have Maxwell optimizations, so I'm sure they'll catch up to the Nervana kernels quite a bit. But knowing the madman that Scott Gray is, I doubt they'll be able to match the performance of his work (which is funny considering they make the hardware :p).

Curious to see how big the gap still is with cudnn v3. It was supposed to be released by now, I wonder what's keeping them!

@nouiz if cudnn v3 will give almost similar performance, would time spent on nervana conv kernels be valuable?

I have no benchmark. So I can't tell if this is useful.

If someone is interrested to make a c interface for those kernel that is
great. But maybe make new speed benchmark when it is released soon. If you
find someone that do it, please share the result.

Also, nervana do not have the same memory layout for convolution. Do one of
you know what it is?

On Fri, Jul 31, 2015 at 7:03 PM, Pranav Shyam notifications@github.com
wrote:

@nouiz https://github.com/nouiz if cudnn v3 will give almost similar
performance, would time spent on nervana conv kernels be valuable?


Reply to this email directly or view it on GitHub
#2 (comment)
.

they use batch-size-last order (i.e. c01b, like cuda-convnet). Scott Gray once explained to me why this is inherently better for convolutions, but I don't remember what the reason was.