[FeatureRequest] port SpatialUpSamplingNearest and SpatialBatchNormalization from cunn

Question

[FeatureRequest] port SpatialUpSamplingNearest and SpatialBatchNormalization from cunn

pawni opened this issue 8 years ago · comments

I tried to rebuild something like http://tinyclouds.org/colorize/ and found that SpatialUpSamplingNearest and SpatialBatchNormalization are not available within clnn. I tried porting Upsampling yesterday (pawni@091882d ). It seems to work but I haven't tested it properly yet. I will continue to have a look at that and the BatchNormalization, however input would be much appreciated. :)

Nick Pawlowski commented 8 years ago

(push)

Hugh Perkins · Answer 1 · Fri Feb 12 2016 17:54:10 GMT+0800 (China Standard Time)

Cool. Sounds good :-) Looks like you've already ported it to THNN too? Let me know when you want me to merge.

Nick Pawlowski · Answer 2 · Fri Feb 12 2016 18:23:21 GMT+0800 (China Standard Time)

Yes I did, as I did not know how not to put it into THNN and be able to use it ;) Is there anything I should keep in mind when porting the things or is the current port okay as it is?

Hugh Perkins · Answer 3 · Fri Feb 12 2016 19:25:06 GMT+0800 (China Standard Time)

Ah... hmm... ah :-) Yes, there is a breaking change coming up. Look at torch/nn#601 To see what's going to happen, and the effect:

git clone https://github.com/fmassa/nn.git
cd nn
git checkout THNN_fmassa_2
luarocks make rocks/nn-scm-1.rockspec

Now, try re-running your tests :-)

Hugh Perkins · Answer 4 · Thu Feb 18 2016 07:59:27 GMT+0800 (China Standard Time)

Hi, the THNN changes have been merged in now torch/nn#601 (comment)

Nick Pawlowski · Answer 5 · Wed Feb 24 2016 19:20:29 GMT+0800 (China Standard Time)

Hi, saw that and tested it. I think THNN isn't the problem but I get some weird values which I need to fix - I need to check when I have some time to figure out wether it is the CL code or somewhere in the c++ where my values are getting corrupted. It seems like I am running over the limit of the data types but not sure though.

Hugh Perkins · Answer 6 · Fri Feb 26 2016 11:11:52 GMT+0800 (China Standard Time)

Ok. Just in case it's useful, what I normally do for testing kernels is:

add some c/c++ code to print out the result of calling a kernel
modify the kernel to not update the output tensors
add an if clause, for one single kernel thread, and use that to write out a bunch of debug/diagnostic info, like:

kernel something(foo, blah, ....) {
  // stuff here

  // out[get_global_id(0)] = foo; // commented out

  if(get_global_id(0) == 0) { // only one thread enters this, ever
    out[0] = 123; // some visible value, check anything is happening at all
    out[1] = b[0]; // find out what is in b[0]
    // etc ...
  }
}

Nick Pawlowski · Answer 7 · Thu Mar 10 2016 00:32:44 GMT+0800 (China Standard Time)

I tried adding your debug code however I can't even see the 123 in the output. Furthermore, it mostly does not change the output but sometimes it seems to put random values into it. Do you have any idea why that might happen?
I made the changes here: https://github.com/pawni/clnn/blob/master/lib/THCLNN/SpatialUpSamplingNearest.cpp
And there is also a test script testing the module similar to the other ones.

Hugh Perkins · Answer 8 · Thu Mar 10 2016 01:12:40 GMT+0800 (China Standard Time)

(accidentally clicked into this. please paste a new, dummy, message, so it stays in my notifications. i will check after work / at weekend)

Hugh Perkins · Answer 9 · Thu Mar 10 2016 08:25:53 GMT+0800 (China Standard Time)

oh, it should be k.out(output).

Also, you'll need to call synhronize(), and copy the data back from the gpu to the cpu.

To synchronize is something like:

    cl->finish()

To copy back from gpu to cpu is something like:

THClTensor_wrapper(state, output)->copyToHost();

Hugh Perkins · Answer 10 · Thu Mar 10 2016 08:27:37 GMT+0800 (China Standard Time)

(it might be that changing from k.in to k.out, for hte output, is sufficient to solve the issue. But usually custom kernels require a fair amount of debugging. Depends...

Hugh Perkins · Answer 11 · Thu Mar 10 2016 15:34:06 GMT+0800 (China Standard Time)

(just to be clear, you dont need to synchronize, or copy data back, normally. just if you are debugging. actually since you are using cout to print, it might handle that automatically, I'm not sure)

Hugh Perkins · Answer 12 · Thu Mar 10 2016 15:39:10 GMT+0800 (China Standard Time)

(As for, what is the difference between k.in, and k.out. Basically, k.in will copy data from main memory to the gpu memory before running the kernel, and then do nothing afterwards, as far as that particular value or tensor. k.out will leave the gpu-side buffer as created but not initialized in any particular way, and certainly not with the contents of the cpu-side tensor copied into it. After running the kernel, the data will be copied from gpu into main memory. It's been a while, but I think that the way it does this is by setting a 'dirty' flag, so it wont actually explicitly copy stuff, unless you try to read it and stuff, but it will cause a copy to take place, when you try to read it, as long as you read it through tensor api methods. And there's also inout which does both. transfer in both directions. In the back of my mind, I think that inout is the same as one of in or out in practice, but I cant remember which, so just use the one that matches what should be happening)

Nick Pawlowski · Answer 13 · Thu Mar 10 2016 20:35:59 GMT+0800 (China Standard Time)

That made it an easy fix - I just changed the out and added the finish() and copyToHost() and it seems to work now. Thanks a lot! Shall I create a PR for that?

Nick Pawlowski · Answer 14 · Thu Mar 10 2016 20:37:41 GMT+0800 (China Standard Time)

(see pawni@7f9a867 )

Hugh Perkins · Answer 15 · Thu Mar 10 2016 22:23:56 GMT+0800 (China Standard Time)

I just changed the out and added the finish() and copyToHost() and it seems to work now.

Ok great!

Actually, the finish and copyToHost are only needed for debugging (so you can see the output, and even then I'm not sure they're needed, if you use cout, as you do). You should be able to remove them now, and it should continue to work. Hopefully :-)

Nick Pawlowski · Answer 16 · Thu Mar 10 2016 22:34:26 GMT+0800 (China Standard Time)

Just changed it and it is still working as it should - thanks! :)
Do you want me to PR?

Hugh Perkins · Answer 17 · Thu Mar 10 2016 22:44:01 GMT+0800 (China Standard Time)

Yes, please create a PR.

Hugh Perkins · Answer 18 · Thu Mar 10 2016 22:46:08 GMT+0800 (China Standard Time)

Well... I get the following error messages when I run the tests:

SpatialUpSamplingNearest_backward
 Function call failed 
/home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:1045: OpenCL error, code: CL_INVALID_ARG_SIZE at /home/ubuntu/git/cltorch/src/lib/THClKernels.cpp:27
stack traceback:
    [C]: in function 'v'
    /home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:1045: in function 'SpatialUpSamplingNearest_updateOutput'
    ...ch/install/share/lua/5.1/nn/SpatialUpSamplingNearest.lua:50: in function 'forward'
    ...tall/share/lua/5.1/clnn/testSpatialUpSamplingNearest.lua:123: in function 'v'
    /home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2619: in function </home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2617>
    [C]: in function 'xpcall'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
    /home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2658: in function 'test'
    (command line):1: in main chunk
    [C]: at 0x00406670

--------------------------------------------------------------------------------
SpatialUpSamplingNearest_forward_batch
 Function call failed 
/home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:1045: OpenCL error, code: CL_INVALID_ARG_SIZE at /home/ubuntu/git/cltorch/src/lib/THClKernels.cpp:27
stack traceback:
    [C]: in function 'v'
    /home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:1045: in function 'SpatialUpSamplingNearest_updateOutput'
    ...ch/install/share/lua/5.1/nn/SpatialUpSamplingNearest.lua:50: in function 'forward'
    ...tall/share/lua/5.1/clnn/testSpatialUpSamplingNearest.lua:36: in function 'v'
    /home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2619: in function </home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2617>
    [C]: in function 'xpcall'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
    /home/ubuntu/torch/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
    /home/ubuntu/torch/install/share/lua/5.1/clnn/test.lua:2658: in function 'test'
    (command line):1: in main chunk
    [C]: at 0x00406670

Thoughts?

Hugh Perkins · Answer 19 · Thu Mar 10 2016 22:49:16 GMT+0800 (China Standard Time)

Probably no_elements should be an int, to match what you are feeding to k.in, and because pretty much everything in clnn is ints for now (no-one has ever complained once about this, as far as i remember, so ints they stay for now :-) )

Nick Pawlowski · Answer 20 · Thu Mar 10 2016 22:50:54 GMT+0800 (China Standard Time)

Weird - it runs fine for me..
what are the versions of your other libraries / environment so that I can try to replicate?
But I also changed the long to int (just kept it from the conversion ;) ) so you could try again

Hugh Perkins · Answer 21 · Thu Mar 10 2016 22:56:37 GMT+0800 (China Standard Time)

Ok. Seems like there is a bunch of stuff missing from my CMakeLists.txt too, so cog isnt running. All that stuff in quotation marks at the bottom of the .cpp file should be generated automatically by cog.

Hugh Perkins · Answer 22 · Thu Mar 10 2016 22:57:20 GMT+0800 (China Standard Time)

hmmm, mine still fails actually. I'll push where I am to a branch.

Hugh Perkins · Answer 23 · Thu Mar 10 2016 23:01:45 GMT+0800 (China Standard Time)

Here's the changes I've made: https://github.com/pawni/clnn/compare/master...hughperkins:pawni-master?expand=1 You should be able to run 'cog' to update the bits in quotation marks at the bottom of the cpp files by doing:

cd build
ccmake ..
flick 'DEV_RUN_COG' to 'ON' (you'll need python in the path)
'c' for configure, 'g' for generate'
cd ..
luarocks make clnn-scm-1.rockspec => should now copy the contents of the .cl files into the .cpp files, whenever the .cl files are updated

Hugh Perkins · Answer 24 · Thu Mar 10 2016 23:03:09 GMT+0800 (China Standard Time)

Weird - it runs fine for me..
what are the versions of your other libraries / environment so that I can try to replicate?
But I also changed the long to int (just kept it from the conversion ;) ) so you could try again

Well, the specific error CL_ARG_SIZE normally means something like there is a mismatch between the arguments you are passing to k.in and the arguments expected by the kernel, in the .cl file. It could be gpu-specific plausibly, but shouldnt depend on other libraires and stuff, I think.

I'm using an NVIDIA 940M, on Ubuntu 15.10 64-bit.

Hugh Perkins · Answer 25 · Thu Mar 10 2016 23:03:53 GMT+0800 (China Standard Time)

Oh wait, changing from long to int fixed forward, but back is still broken. I will do the same thing in backward.

Hugh Perkins · Answer 26 · Thu Mar 10 2016 23:09:40 GMT+0800 (China Standard Time)

Ok, passing now. I've merged to master: https://github.com/hughperkins/clnn/commits/master

Nick Pawlowski · Answer 27 · Thu Mar 10 2016 23:10:09 GMT+0800 (China Standard Time)

awesome, thanks!

Hugh Perkins · Answer 28 · Thu Mar 10 2016 23:10:11 GMT+0800 (China Standard Time)

Thank you very much Nick

Nick Pawlowski · Answer 29 · Thu Mar 10 2016 23:19:36 GMT+0800 (China Standard Time)

Thank you for the support! I hope that I'll also find time to get my head around the batch normalisation but that looked a bit harder than the up sampling - I'll keep you posted about it.

Hugh Perkins · Answer 30 · Thu Mar 10 2016 23:26:56 GMT+0800 (China Standard Time)

Ok, cool. Sounds good :-) Batch normalization would be very useful :-)

Hugh Perkins · Answer 31 · Fri Mar 11 2016 20:40:09 GMT+0800 (China Standard Time)

Oh... SpatialBatchNormalization does look quite challenging... the good news is, no thrust. thrust is basically not directly portable, needs a bunch of creativity.

SpatialBatchNormalization contains a zillion kernels, but they should be all fairly directly portable. Probably a bunhc of work though. All those bits with triple angle brackets are CUDA kernel launches, eg:

SpatialBatchNormalizationBackward_kernel<8>
      <<<blocks, threads, 0, s>>>
      (input, gradOutput, gradInput, gradWeight, gradBias, weight,
       saveMean, saveStd, scale);

The kernel itself is higher up in the same file, and it's a template, with parameters.

OpenCL itself doesnt handle templates. OpenCL is essentially C99, whereas templates are from c++. What I'm doing is using lua as a templating language.

You can see an example by looking at the CUDA implementation of THCApply.cuh, in cutorch, and comparing it to the OpenCL version, ie compare https://github.com/torch/cutorch/blob/master/lib/THC/THCApply.cuh with https://github.com/hughperkins/cltorch/blob/master/src/lib/THClApply.cpp and https://github.com/hughperkins/cltorch/blob/master/src/lib/THClApply.cl

The stuff with curly brackets {% ... %} is templating, in the style of Jinja2, but using {% ... %}, instead of {{ ... }}.

Hugh Perkins · Answer 32 · Sun Apr 10 2016 02:30:48 GMT+0800 (China Standard Time)

Just noticed that SpatialBatchNormaliation used to be working, and was broken in torch/nn#665 and an earlier commit. Fixed it now. Should be working again now?

Hugh Perkins · Answer 33 · Sun May 01 2016 02:25:29 GMT+0800 (China Standard Time)

Hi, I decided to take the radical step of rolling back torch and cltorch to aorund 21 February, prior to a bunch of the THNN changes, which were causes entropy at a rate faster than I could handle. This has the downside that thats just slightly before your PR for spatialupsamplingnearest. So you might need to resubmit the PR for that please.

Hugh Perkins · Answer 34 · Mon May 02 2016 07:15:42 GMT+0800 (China Standard Time)

Should all be working again now, I think?