hughperkins / clnn

OpenCL backend for Torch nn neural networks library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FeatureRequest] port SpatialUpSamplingNearest and SpatialBatchNormalization from cunn

pawni opened this issue · comments

I tried to rebuild something like http://tinyclouds.org/colorize/ and found that SpatialUpSamplingNearest and SpatialBatchNormalization are not available within clnn. I tried porting Upsampling yesterday (pawni@091882d ). It seems to work but I haven't tested it properly yet. I will continue to have a look at that and the BatchNormalization, however input would be much appreciated. :)

Cool. Sounds good :-) Looks like you've already ported it to THNN too? Let me know when you want me to merge.

Yes I did, as I did not know how not to put it into THNN and be able to use it ;) Is there anything I should keep in mind when porting the things or is the current port okay as it is?

Ah... hmm... ah :-) Yes, there is a breaking change coming up. Look at torch/nn#601 To see what's going to happen, and the effect:

git clone https://github.com/fmassa/nn.git
cd nn
git checkout THNN_fmassa_2
luarocks make rocks/nn-scm-1.rockspec

Now, try re-running your tests :-)

Hi, the THNN changes have been merged in now torch/nn#601 (comment)

Hi, saw that and tested it. I think THNN isn't the problem but I get some weird values which I need to fix - I need to check when I have some time to figure out wether it is the CL code or somewhere in the c++ where my values are getting corrupted. It seems like I am running over the limit of the data types but not sure though.

Ok. Just in case it's useful, what I normally do for testing kernels is:

  • add some c/c++ code to print out the result of calling a kernel
  • modify the kernel to not update the output tensors
  • add an if clause, for one single kernel thread, and use that to write out a bunch of debug/diagnostic info, like:
kernel something(foo, blah, ....) {
  // stuff here

  // out[get_global_id(0)] = foo; // commented out

  if(get_global_id(0) == 0) { // only one thread enters this, ever
    out[0] = 123; // some visible value, check anything is happening at all
    out[1] = b[0]; // find out what is in b[0]
    // etc ...
  }
}

I tried adding your debug code however I can't even see the 123 in the output. Furthermore, it mostly does not change the output but sometimes it seems to put random values into it. Do you have any idea why that might happen?
I made the changes here: https://github.com/pawni/clnn/blob/master/lib/THCLNN/SpatialUpSamplingNearest.cpp
And there is also a test script testing the module similar to the other ones.

(accidentally clicked into this. please paste a new, dummy, message, so it stays in my notifications. i will check after work / at weekend)

(push)

oh, it should be k.out(output).

Also, you'll need to call synhronize(), and copy the data back from the gpu to the cpu.

To synchronize is something like:

    cl->finish()

To copy back from gpu to cpu is something like:

THClTensor_wrapper(state, output)->copyToHost();

(it might be that changing from k.in to k.out, for hte output, is sufficient to solve the issue. But usually custom kernels require a fair amount of debugging. Depends...

(just to be clear, you dont need to synchronize, or copy data back, normally. just if you are debugging. actually since you are using cout to print, it might handle that automatically, I'm not sure)

(As for, what is the difference between k.in, and k.out. Basically, k.in will copy data from main memory to the gpu memory before running the kernel, and then do nothing afterwards, as far as that particular value or tensor. k.out will leave the gpu-side buffer as created but not initialized in any particular way, and certainly not with the contents of the cpu-side tensor copied into it. After running the kernel, the data will be copied from gpu into main memory. It's been a while, but I think that the way it does this is by setting a 'dirty' flag, so it wont actually explicitly copy stuff, unless you try to read it and stuff, but it will cause a copy to take place, when you try to read it, as long as you read it through tensor api methods. And there's also inout which does both. transfer in both directions. In the back of my mind, I think that inout is the same as one of in or out in practice, but I cant remember which, so just use the one that matches what should be happening)

That made it an easy fix - I just changed the out and added the finish() and copyToHost() and it seems to work now. Thanks a lot! Shall I create a PR for that?

I just changed the out and added the finish() and copyToHost() and it seems to work now.

Ok great!

Actually, the finish and copyToHost are only needed for debugging (so you can see the output, and even then I'm not sure they're needed, if you use cout, as you do). You should be able to remove them now, and it should continue to work. Hopefully :-)

Just changed it and it is still working as it should - thanks! :)
Do you want me to PR?

Yes, please create a PR.

Well... I get the following error messages when I run the tests:

SpatialUpSamplingNearest_backward
 Function call failed 
/home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:1045: OpenCL error, code: CL_INVALID_ARG_SIZE at /home/ubuntu/git/cltorch/src/lib/THClKernels.cpp:27
stack traceback:
    [C]: in function 'v'
    /home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:1045: in function 'SpatialUpSamplingNearest_updateOutput'
    ...ch/install/share/lua/5.1/nn/SpatialUpSamplingNearest.lua:50: in function 'forward'
    ...tall/share/lua/5.1/clnn/testSpatialUpSamplingNearest.lua:123: in function 'v'
    /home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2619: in function </home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2617>
    [C]: in function 'xpcall'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
    /home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2658: in function 'test'
    (command line):1: in main chunk
    [C]: at 0x00406670

--------------------------------------------------------------------------------
SpatialUpSamplingNearest_forward_batch
 Function call failed 
/home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:1045: OpenCL error, code: CL_INVALID_ARG_SIZE at /home/ubuntu/git/cltorch/src/lib/THClKernels.cpp:27
stack traceback:
    [C]: in function 'v'
    /home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:1045: in function 'SpatialUpSamplingNearest_updateOutput'
    ...ch/install/share/lua/5.1/nn/SpatialUpSamplingNearest.lua:50: in function 'forward'
    ...tall/share/lua/5.1/clnn/testSpatialUpSamplingNearest.lua:36: in function 'v'
    /home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2619: in function </home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2617>
    [C]: in function 'xpcall'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
    /home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2658: in function 'test'
    (command line):1: in main chunk
    [C]: at 0x00406670

Thoughts?

Probably no_elements should be an int, to match what you are feeding to k.in, and because pretty much everything in clnn is ints for now (no-one has ever complained once about this, as far as i remember, so ints they stay for now :-) )

Weird - it runs fine for me..
what are the versions of your other libraries / environment so that I can try to replicate?
But I also changed the long to int (just kept it from the conversion ;) ) so you could try again

Ok. Seems like there is a bunch of stuff missing from my CMakeLists.txt too, so cog isnt running. All that stuff in quotation marks at the bottom of the .cpp file should be generated automatically by cog.

hmmm, mine still fails actually. I'll push where I am to a branch.

Here's the changes I've made: https://github.com/pawni/clnn/compare/master...hughperkins:pawni-master?expand=1 You should be able to run 'cog' to update the bits in quotation marks at the bottom of the cpp files by doing:

  • cd build
  • ccmake ..
  • flick 'DEV_RUN_COG' to 'ON' (you'll need python in the path)
  • 'c' for configure, 'g' for generate'
  • cd ..
  • luarocks make clnn-scm-1.rockspec => should now copy the contents of the .cl files into the .cpp files, whenever the .cl files are updated

Weird - it runs fine for me..
what are the versions of your other libraries / environment so that I can try to replicate?
But I also changed the long to int (just kept it from the conversion ;) ) so you could try again

Well, the specific error CL_ARG_SIZE normally means something like there is a mismatch between the arguments you are passing to k.in and the arguments expected by the kernel, in the .cl file. It could be gpu-specific plausibly, but shouldnt depend on other libraires and stuff, I think.

I'm using an NVIDIA 940M, on Ubuntu 15.10 64-bit.

Oh wait, changing from long to int fixed forward, but back is still broken. I will do the same thing in backward.

Ok, passing now. I've merged to master: https://github.com/hughperkins/clnn/commits/master

awesome, thanks!

Thank you very much Nick

Thank you for the support! I hope that I'll also find time to get my head around the batch normalisation but that looked a bit harder than the up sampling - I'll keep you posted about it.

Ok, cool. Sounds good :-) Batch normalization would be very useful :-)

Oh... SpatialBatchNormalization does look quite challenging... the good news is, no thrust. thrust is basically not directly portable, needs a bunch of creativity.

SpatialBatchNormalization contains a zillion kernels, but they should be all fairly directly portable. Probably a bunhc of work though. All those bits with triple angle brackets are CUDA kernel launches, eg:

SpatialBatchNormalizationBackward_kernel<8>
      <<<blocks, threads, 0, s>>>
      (input, gradOutput, gradInput, gradWeight, gradBias, weight,
       saveMean, saveStd, scale);

The kernel itself is higher up in the same file, and it's a template, with parameters.

OpenCL itself doesnt handle templates. OpenCL is essentially C99, whereas templates are from c++. What I'm doing is using lua as a templating language.

You can see an example by looking at the CUDA implementation of THCApply.cuh, in cutorch, and comparing it to the OpenCL version, ie compare https://github.com/torch/cutorch/blob/master/lib/THC/THCApply.cuh with https://github.com/hughperkins/cltorch/blob/master/src/lib/THClApply.cpp and https://github.com/hughperkins/cltorch/blob/master/src/lib/THClApply.cl

The stuff with curly brackets {% ... %} is templating, in the style of Jinja2, but using {% ... %}, instead of {{ ... }}.

Just noticed that SpatialBatchNormaliation used to be working, and was broken in torch/nn#665 and an earlier commit. Fixed it now. Should be working again now?

Hi, I decided to take the radical step of rolling back torch and cltorch to aorund 21 February, prior to a bunch of the THNN changes, which were causes entropy at a rate faster than I could handle. This has the downside that thats just slightly before your PR for spatialupsamplingnearest. So you might need to resubmit the PR for that please.

Should all be working again now, I think?