MNIST test fails to run on Radeon 8750M

Question

MNIST test fails to run on Radeon 8750M

battlesnake opened this issue 9 years ago · comments

Radeon 8750M on HP Probook 470 G1

clinfo:

Number of platforms                               1
  Platform Name                                   Clover
  Platform Vendor                                 Mesa
  Platform Version                                OpenCL 1.1 MESA 11.0.4
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             MESA

  Platform Name                                   Clover
Number of devices                                 1
  Device Name                                     AMD OLAND (DRM 2.42.0, LLVM 3.7.0)
  Device Vendor                                   AMD
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.1 MESA 11.0.4
  Driver Version                                  11.0.4
  Device OpenCL C Version                         OpenCL C 1.1 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE

Error:

initializing clblas
cl/activate.cl build log: 
input.cl:28:23: warning: implicit declaration of function 'tanh' is invalid in C99
input.cl:11:42: note: expanded from macro 'ACTIVATION_FUNCTION'
unsupported call to function tanh in activate

Hugh Perkins commented 9 years ago

:-)

Mark Cowan · Answer 1 · Mon Nov 16 2015 20:31:31 GMT+0800 (China Standard Time)

I'm guessing that my OpenCL compiler/hardware doesn't support tanh

Hugh Perkins · Answer 2 · Mon Nov 16 2015 20:45:29 GMT+0800 (China Standard Time)

Yes.... that's two of you now... but looking at http://www.notebookcheck.net/AMD-Radeon-HD-8750M.87147.0.html seems this card supports OpenCL 1.2?

Hugh Perkins · Answer 3 · Mon Nov 16 2015 20:53:03 GMT+0800 (China Standard Time)

(Had a browse through the AMD drivers download, and came up with http://support.amd.com/en-us/download/desktop?os=Linux+x86_64 , but it didnt say what version of OpenCL these support... might be worth a shot though?)

(Edit: ah, per http://wiki.cchtml.com/index.php/Hardware , seems like maybe this driver doesnt support HD8750M? Kind of hard to tell...)

Hugh Perkins · Answer 4 · Mon Nov 16 2015 21:06:54 GMT+0800 (China Standard Time)

(Note that I'm pretty sure opencl 1.1 itself solidly supports tanh: https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/tan.html )

Mark Cowan · Answer 5 · Mon Nov 16 2015 21:12:15 GMT+0800 (China Standard Time)

I'm running the open-source driver, which could be why it is at v1.1

Hugh Perkins · Answer 6 · Mon Nov 16 2015 21:24:18 GMT+0800 (China Standard Time)

Ok. Maybe you can add in tanh :-) I think you can construct it as ( e^x - c^(-x) ) / ( e^x + e^(-x) ). Or... you could hack the DeepCL code to use this. In cl/activate.cl, simply replace tanh(output) with this expression.

Mark Cowan · Answer 7 · Mon Nov 16 2015 21:45:39 GMT+0800 (China Standard Time)

I know :) I'm just curious as to why the OpenCL compiler can't already do that - maybe I should submit a bug report for mesa

Hugh Perkins · Answer 8 · Tue Nov 17 2015 07:19:52 GMT+0800 (China Standard Time)

Well... I guess you could modify their code. But... I'm curious how this could be running on your GPU. It seems like Clover is compiling the opencl itself, and I doubt it's compiling into AMD ISA? (ie http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf and so on ). So, seems like it might be running in software, on your CPU?

Mark Cowan · Answer 9 · Tue Nov 17 2015 07:45:54 GMT+0800 (China Standard Time)

Could be, although I'd be amazed if my Haswell + Intel's drivers couldn't handle tanh!

Hugh Perkins · Answer 10 · Tue Nov 17 2015 13:10:00 GMT+0800 (China Standard Time)

So, what I think is, Clover runs your OpenCL code, just like a script, like running something in Python and so on. I'm fairly sure that no OpenCL code is actually sent to your CPU, to run as Intel OpenCL.

OpenCL => Clover OpenCL Scripting engine => executes as normal x86 program, inside Clover

(Edit: I should have just used hte 'Edit' button really :-P Not sure why I didnt.... )

(Edit2: but I'm not really sure what Clover is doing. Scripting would sound too slow. But compiling to x86 on the fly would sound strange too. But compiling to AMD ISA seems unlikely. Soo... ????)

Mark Cowan · Answer 11 · Tue Nov 17 2015 21:03:34 GMT+0800 (China Standard Time)

You're right, Clover runs on CPU. Given this, I'm amazed that it doesn't support tanh...

Hugh Perkins · Answer 12 · Mon Jan 11 2016 18:28:20 GMT+0800 (China Standard Time)

Hi. Apparently I'm wrong. Clover does run on GPU :-) See Element-Research/rnn#41

Hugh Perkins · Answer 13 · Mon Jan 11 2016 18:33:21 GMT+0800 (China Standard Time)

So, what I'd suggest is, creating a fork/branch, and modifying activate.cl to write tanh in terms of exp. The expression for tanh in terms of exp is something like: (exp(y) - exp(-y)) / (exp(y) + exp(-y))

Hugh Perkins · Answer 14 · Mon Jan 11 2016 18:36:49 GMT+0800 (China Standard Time)

Note: you'd need to modify these lines basically: https://github.com/hughperkins/DeepCL/blob/master/cl/activate.cl#L10-L13

#ifdef TANH
    #define ACTIVATION_FUNCTION(output) (tanh(output))
#elif defined SCALEDTANH
    #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))

I guess it will need to look something like (just doing it for TANH for now):

#ifdef TANH
    #define ACTIVATION_FUNCTION(output) ((exp(output) - exp(-output)) / (exp(output) + exp(-output)))

William Bernoudy · Answer 15 · Mon Mar 28 2016 16:49:17 GMT+0800 (China Standard Time)

I've been having the same issue with tanh on my system (R9 290 but with OpenCL 1.1). I would like to start messing around with what you've suggested hughperkins. However, editing the cl/activate.cl seems to do absolutely nothing, after make or after running the test Python script. After running the Python script, I still get:

Something went wrong with clCreateKernel, OpenCL erorr code -45
cl/activate.cl build log: 
input.cl:30:23: warning: implicit declaration of function 'tanh' is invalid in C99
input.cl:11:42: note: expanded from macro 'ACTIVATION_FUNCTION'
unsupported call to function tanh in activate
Traceback (most recent call last):
  File "test_deepcl.py", line 35, in <module>
    net, "rt2-8c5z-relu-mp2-16c5z-relu-mp3-150n-tanh-10n")
  File "NetDefToNet.pyx", line 7, in PyDeepCL.NetdefToNet.createNetFromNetdef (PyDeepCL.cxx:15006)
RuntimeError: 
kernel source:

It then repeats the original source of cl/activate.cl without any of my changes. I'm sure I'm missing something simple, but do you mind pointing out what I'm missing? How do I actually test my changes to cl/activate.cl?

Hugh Perkins · Answer 16 · Mon Mar 28 2016 17:09:55 GMT+0800 (China Standard Time)

You need to do one of two things, either:

obtain an OpenCL driver for your GPU that implements the tanh function, or
modify DeepCL to write tanh using exp function, which is fairly straightforward, see #35 (comment)

William Bernoudy · Answer 17 · Mon Mar 28 2016 17:16:50 GMT+0800 (China Standard Time)

Thanks for the quick reply! I'm trying to choose option 2 and modify DeepCL.

However, after I've modified cl/activate.cl by using the exp and compiled, my changes don't seem to matter. I get the same error and the build log error shows the previous source code. Does that make sense? What do I need to do to compile the changes I've made to cl/activate.cl?

Hugh Perkins · Answer 18 · Mon Mar 28 2016 17:29:17 GMT+0800 (China Standard Time)

You'll need to turn on cog during compilation. Basically, cd into build directory, run ccmake .., and set the configuration options something like:

You wont see the COG option initially, but if you set MAINTAINER_OPTIONS to ON, and press c, it will appear :-)

I'd recommend you start by getting the non-python version working first, since it involves fewer compilation steps. You can test by running the unit tests ./deepcl_unittests, creating a new one if necessary, but I think you can use:

./deepcl_unittests tests=testactivationforward.comparespecific_0_1_activation3_small2_tanh

William Bernoudy · Answer 19 · Mon Mar 28 2016 18:00:42 GMT+0800 (China Standard Time)

Great! Took a little more troubleshooting but compiling is working and my changes are showing up.

Thanks again for the quick response and the wonderful project.

Hugh Perkins · Answer 20 · Mon Mar 28 2016 18:26:27 GMT+0800 (China Standard Time)

Awesome! Once you have that working, do you mind creating a fork/branch, so other people can use it too? (I'll probably take a copy of the fork too; and might ponder if there's a way of adding it into master somehow)

William Bernoudy · Answer 21 · Mon Mar 28 2016 18:38:50 GMT+0800 (China Standard Time)

I'm happy to do so. However, a few things to consider:

I basically just added one line to the cl/activate.cl consisting of your suggestion. This then fixed the missing tanh problem, and I was finally able to pass the testactivationforward.comparespecific_0_1_activation3_small2_tanh unit test.
However, I was still failing 49 of the unit tests (before the fix I was failing 50). This seems to be due to the another OpenCL 1.1 error, OpenCL does not support the 'static' storage class specifier.
I randomly just got OpenCL 1.2 to work and now I am passing 100% of the unit tests.

Do you still think it's worth it for just this fix? If OpenCL 1.1 is a priority, it seems we might want to tackle the other issue as well...

Hugh Perkins · Answer 22 · Mon Mar 28 2016 18:52:47 GMT+0800 (China Standard Time)

Ah, so finally you've switched to OpenCL 1.2 for now? Concretely, this means you are using the AMD drivers, rather than the Clover drivers, is that right?

Hugh Perkins · Answer 23 · Mon Mar 28 2016 19:52:25 GMT+0800 (China Standard Time)

By the way, I dont think either of the issues (missing tanh and missing static support are actually OpenCL 1.1 issues, I think these are both specific to the the Clover implementation of OpenCL 1.1. Having said this, maybe we can create a fork like 'clover-compatibility', where we handle these two issues? You can start by creating this from your change '1.', and then someone can revert my static additions, to handle the clover issue with static support (probably just revert some of 022b6a3 approximately, or just go through the cl/*.cl files, and replace static with ``)

William Bernoudy · Answer 24 · Mon Mar 28 2016 20:20:55 GMT+0800 (China Standard Time)

Not exactly. Before I was using the Clover drivers for OpenCL 1.1. I was then trying to upgrade OpenCL and somehow managed to get the Intel driver for OpenCL 1.2. So all the tests were successful using the Intel driver and my i5.

I now finally got the AMD drivers working for my R9 290 with OpenCL 2.0. That is also passing all the unit tests.

So yes, it seems like it is just Clover. I created the PR: #60

Hugh Perkins · Answer 25 · Tue Mar 29 2016 18:26:59 GMT+0800 (China Standard Time)

For anyone else coming across this thread, please note that the clover compatibility fork is at: https://github.com/hughperkins/DeepCL/tree/clover-compatibility (or https://github.com/rhyzomatic/DeepCL/tree/clover-compatibility , depending on how you look at it)

Hugh Perkins · Answer 26 · Sat May 21 2016 17:34:46 GMT+0800 (China Standard Time)

Added notes on this to README.md 8ac7b0f Closing this issue for now