missing API against TH

Question

missing API against TH

soumith opened this issue 10 years ago · comments

The following math functions are missing in THC but present in TH:

When these are implemented, cwrap entries can be added that would make cutorch completely API compatible with torch

Clement Farabet · Answer 1 · Sun Nov 09 2014 01:26:24 GMT+0800 (China Standard Time)

Hey @soumith , so running cutorch.test() currently reports errors on cumsum and cumprod (missing). That's on purpose?

Soumith Chintala · Answer 2 · Sun Nov 09 2014 01:27:42 GMT+0800 (China Standard Time)

Yes they are yet to be implemented on cutorch. I'll get to that next week. They can be done with a thrust:: scan

Clement Farabet · Answer 3 · Sun Nov 09 2014 02:13:41 GMT+0800 (China Standard Time)

Ok cool.

On Sat, Nov 8, 2014 at 12:27 PM, Soumith Chintala notifications@github.com
wrote:

Yes they are yet to be implemented on cutorch. I'll get to that next week.
They can be done with a thrust:: scan

—
Reply to this email directly or view it on GitHub
#70 (comment).

Dominik Grewe · Answer 4 · Sat Jan 03 2015 21:11:54 GMT+0800 (China Standard Time)

Does anyone have an implementation of std and var yet? Otherwise I'll have a go next week.
Afaict, we can't use THCudaTensor_reduceDim so we'll need something custom (but with a similar structure as reduceDim).

Soumith Chintala · Answer 5 · Sun Jan 04 2015 02:45:14 GMT+0800 (China Standard Time)

no, do not have it. this has been long overdue, but we should co-ordinate these, let me email you guys.

Dominik Grewe · Answer 6 · Tue Jan 20 2015 21:46:26 GMT+0800 (China Standard Time)

Any progress on cumsum and cumprod?

Soumith Chintala · Answer 7 · Wed Jan 21 2015 04:24:07 GMT+0800 (China Standard Time)

have not started them yet!

Dominik Grewe · Answer 8 · Wed Jan 21 2015 06:08:17 GMT+0800 (China Standard Time)

Okay, I might have a go soon then, because we'd like to use it. Just waiting for the THCState PR to be merged.

Soumith Chintala · Answer 9 · Thu Jan 22 2015 01:01:56 GMT+0800 (China Standard Time)

I will merge the THCState PR on Friday. All the patches have been prepared except for fbcunn, working on that as well.

Dominik Grewe · Answer 10 · Thu Feb 19 2015 19:03:28 GMT+0800 (China Standard Time)

@soumith For reductions along a single dimension, we still have the restriction that the tensor must not have more than 4 dimensions. Did you say you're working on a fix for that? If so, what's the progress on it?
If you're not working on it, we could easily change the code to what we do for std and var, where there's no restriction on dimensionality at all.

Soumith Chintala · Answer 11 · Fri Feb 20 2015 00:17:11 GMT+0800 (China Standard Time)

@wickedfoo already has PR for that internally. Its all implemented. I'm on
vacay till 26th, so I will sync those changes at the end of this month.
On Feb 19, 2015 4:34 PM, "Dominik Grewe" notifications@github.com wrote:

@soumith https://github.com/soumith For reductions along a single
dimension, we still have the restriction that the tensor must not have more
than 4 dimensions. Did you say you're working on a fix for that? If so,
what's the progress on it?
If you're not working on it, we could easily change the code to what we do
for std and var, where there's no restriction on dimensionality at all.

Reply to this email directly or view it on GitHub
#70 (comment).

Soumith Chintala · Answer 12 · Fri Feb 20 2015 00:19:37 GMT+0800 (China Standard Time)

His PR is for apply (and apply2) along an arbitrary dimension. Not reductions. It generalizes the copy kernels and changes all the tensor math to use these apply kernels where appropriate instead of make contiguous + thrust

Soumith Chintala · Answer 13 · Fri Feb 20 2015 00:20:17 GMT+0800 (China Standard Time)

That's the status on that. He did not work yet on arbitrary reductions. If you want to tackle that, go for it.

Dominik Grewe · Answer 14 · Fri Feb 20 2015 00:22:18 GMT+0800 (China Standard Time)

Thanks for the update. I'll have a go at the reductions kernel then.
Looking forward to the apply kernels!

Jeff Johnson · Answer 15 · Fri Feb 20 2015 02:17:32 GMT+0800 (China Standard Time)

I'm on vacation too until next week but I don't think the generic apply
stuff got pushed yet, I think only the old version of the copy kernel. I
reimplemented all cutorch math (pointwise operators) in terms of it, so no
newContiguous calls are needed. I don't yet support reductions but it
shouldn't be too hard to add that in.

On Thursday, February 19, 2015, Dominik Grewe notifications@github.com
wrote:

Thanks for the update. I'll have a go at the reductions kernel then.
Looking forward to the apply kernels!

—
Reply to this email directly or view it on GitHub
#70 (comment).

Dominik Grewe · Answer 16 · Tue Mar 03 2015 03:29:39 GMT+0800 (China Standard Time)

If I remember correctly, you guys said you'd look into maskedFill etc, right? Any progress on that?

Jeff Johnson · Answer 17 · Tue Mar 03 2015 06:50:49 GMT+0800 (China Standard Time)

for maskedFill etc. do you want the mask to be a float vector (because that's the only thing we have in cutorch at present), or do you want it to be 4 bytes packed into a float?

Dominik Grewe · Answer 18 · Tue Mar 03 2015 07:19:05 GMT+0800 (China Standard Time)

I guess float vectors make the most sense, because that's what logical functions (gt, ge etc) return.

Jeff Johnson · Answer 19 · Thu Mar 05 2015 10:58:35 GMT+0800 (China Standard Time)

I have maskedFill/Copy/Select done, and sort() I have power-of-2 sizes at present (but on input with an arbitrary number of dimensions), so still working on that. maskedFill, maskedCopy and sort avoid newContiguous on the input, but maskedSelect I chickened out and just used two passes and temporary space with a Thrust prefix scan.

Re: "For reductions along a single dimension, we still have the restriction that the tensor must not have more than 4 dimensions. Did you say you're working on a fix for that? If so, what's the progress on it?" I have this fixed as well, took the copy kernel code and made a reduction kernel out of it, so no calls to newContiguous/copies etc. needed. Not a global reduction kernel (like a norm that reduces down to one point, but reduces along a dimension. sort() exploits similar code. I want to do the same shared memory optimization (so I can use coalesced reads) that you did if the reduction dimension is innermost/most contiguous though.

Dominik Grewe · Answer 20 · Thu Mar 05 2015 17:37:05 GMT+0800 (China Standard Time)

Cool, looking forward to that. Yes, using the shared memory approach for reductions along contiguous dimensions is vital.

Dominik Grewe · Answer 21 · Thu Mar 05 2015 22:13:24 GMT+0800 (China Standard Time)

When do you think you'll have a PR for maskedFill etc?

Soumith Chintala · Answer 22 · Thu Mar 05 2015 22:15:36 GMT+0800 (China Standard Time)

it's in review, and jeff is still working on revamping our code-base to the state argument based change.
Hopefully sometime next week.

And the cutorch TensorMath changes that remove most of the sync points on non-contiguous cases will also land at the same time.

Max Jaderberg · Answer 23 · Thu Mar 12 2015 20:15:26 GMT+0800 (China Standard Time)

Any progress on the masked* functions?

Soumith Chintala · Answer 24 · Thu Mar 12 2015 22:12:22 GMT+0800 (China Standard Time)

Theyre implemented. We are working on refactoring our code and syncing.with master, we will try to merge them this week.

Soumith Chintala · Answer 25 · Fri Mar 13 2015 22:24:49 GMT+0800 (China Standard Time)

An update:
Jeff has powered through our internal refactor, getting us back to parity with oss cutorch.
There are three PRs coming up.

masked* functions
sort
all of the math (where applicable) revamped to use our own pointwise and reduce kernels, so that non-contiguous tensors are no longer sync points.

It will either be EOD today or most likely Monday/Tuesday.

Soumith Chintala · Answer 26 · Fri Mar 13 2015 22:30:57 GMT+0800 (China Standard Time)

Looks like what's left is the small fish.

torch.diag, torch.eye, torch.trace can be handled with the same generic diagonal-apply kernel.
torch.randperm, linspace, logspace, range is a thrust::scan
logicalall, logicalany also apply kernel!?
tril and triu might be tricky

Dominik Grewe · Answer 27 · Fri Mar 13 2015 22:55:40 GMT+0800 (China Standard Time)

Thanks Soumith. Looking forward to that!

What about convolution functions like conv2, conv3 etc? There's code in THCTensorConv.cu, but it's not exposed in Lua. Any idea why? If we want full API parity between Torch and cutorch, we should add those, don't you think?

Soumith Chintala · Answer 28 · Fri Mar 13 2015 22:57:34 GMT+0800 (China Standard Time)

ah yes you are right. not sure why I did not add them to the list. Writing conv2/conv3 kernels from scratch is going to be not worth our time. Maybe we can use the cu* API for that? Either that, or on GPU we use a buffer to unfold and do MM. What do you think?

Dominik Grewe · Answer 29 · Fri Mar 13 2015 23:28:05 GMT+0800 (China Standard Time)

If we can use cuDNN for this, then that would be easiest I guess.

Soumith Chintala · Answer 30 · Sat Mar 14 2015 02:20:38 GMT+0800 (China Standard Time)

Cudnn is still not shipped with the CUDA toolkit, so not everyone has it. So it falls into the murky territory of, do we really want to introduce a hard-dependency on cudnn.

i am okay with a pcall to cudnn and having an error on not-found, but i am not sure how it will go down with the others.

Dominik Grewe · Answer 31 · Wed Apr 01 2015 23:18:13 GMT+0800 (China Standard Time)

There are a number of linear algebra functions missing: symeig, eig, inverse etc. In Torch they seem to be implemented by wrapping Lapack. Could we do something similar for cutorch? There's MAGMA and CULA; does anyone have experience with these libraries?

Soumith Chintala · Answer 32 · Wed Apr 01 2015 23:22:27 GMT+0800 (China Standard Time)

MAGMA looks best, we built MAGMA internally and it looks reasonably good.

Soumith Chintala · Answer 33 · Wed Apr 01 2015 23:23:32 GMT+0800 (China Standard Time)

Also, on the CuDNN note, we can configure a header (like THGeneral.h.in ) if we find cudnn. Caffe has the cmake macros needed for finding cudnn already written: https://github.com/BVLC/caffe/blob/master/cmake/Cuda.cmake

David Pfau · Answer 34 · Mon May 25 2015 06:47:32 GMT+0800 (China Standard Time)

Hi guys. Noticed this thread when dealing with a script that needs diag, svd and eig on CudaTensors. I implemented diag myself in Lua using storage() and set(), but svd and eig are beyond my ken. What's the plan for that?

Soumith Chintala · Answer 35 · Mon May 25 2015 08:56:30 GMT+0800 (China Standard Time)

One of my colleagues @SamGross is working on it by interfacing the magma cuda library. It'll happen over the next month or so when he finishes it up and sends a PR.

Jeff Johnson · Answer 36 · Thu May 28 2015 05:19:12 GMT+0800 (China Standard Time)

This is not on this list, but I'm in the process of implementing THCudaTensor_multinomial as well.

Nicholas Léonard · Answer 37 · Thu May 28 2015 23:16:01 GMT+0800 (China Standard Time)

@wickedfoo Awesome. Would love to see multinomial in cuda.

Hugh Perkins · Answer 38 · Mon Jun 15 2015 20:02:56 GMT+0800 (China Standard Time)

Just to confirm, scatter/gather arent implemented in cutorch, right?

Dominik Grewe · Answer 39 · Mon Jun 15 2015 20:42:54 GMT+0800 (China Standard Time)

That's right. I meant to do it, but haven't had the time yet, sorry.

Hugh Perkins · Answer 40 · Tue Jun 23 2015 08:52:50 GMT+0800 (China Standard Time)

For gather, which I suddenly realize could be useful for implementing ClassNLLCriterion.forward, without needing a custom kernel, I guess?, I suppose a simple naive first-cut could be:

use the isContiguous (toContiguous? asContiguous?) method to convert the tensor to contiguous format
simply assign one thread to each output location, and I dont think we need any local memory or anything right? So, just throw everything into warp-size blocks and it's basically done?

Does that sound about right? Anything else I should bear in mind if I write a naive gather along these lines? (I'll be targeting cltorch I confess, but it's quite easy to convert cutorch<->cltorch kernels I think?)

(Edit: what do you think is the most similar existing class/kernel to base this off? and/or thoughts on where to put this, ie filename(s)?)

Hugh Perkins · Answer 41 · Tue Jun 23 2015 18:11:49 GMT+0800 (China Standard Time)

One of my colleagues @SamGross is working on it by interfacing the magma cuda library. It'll happen over the next month or so when he finishes it up and sends a PR.

Magma looks cool. Has opencl version too it seems :-)

Hugh Perkins · Answer 42 · Wed Jun 24 2015 00:42:23 GMT+0800 (China Standard Time)

Here is a gather implementation. It's not very tested yet..

The kernel: https://github.com/hughperkins/cltorch/blob/master/lib/THCl/THClGather.cl
The driver: https://github.com/hughperkins/cltorch/blob/master/lib/THCl/THClGather.cpp
The lua wrapper: https://github.com/hughperkins/cltorch/blob/master/torch/generic/Tensor.c#L539

Hugh Perkins · Answer 43 · Wed Jun 24 2015 08:24:21 GMT+0800 (China Standard Time)

Shoe-horned the lua wrapper into TensorMath.lua: hughperkins/cltorch@0e469f4

Hugh Perkins · Answer 44 · Wed Jun 24 2015 18:50:06 GMT+0800 (China Standard Time)

Did scatter too, since seems like more of the same?

(Edit: and scatterFill, in same files)

Arunkumar Byravan · Answer 45 · Tue Feb 23 2016 08:01:02 GMT+0800 (China Standard Time)

Is there any update on adding these functions? I'm in need of the cross product on cudatensors.

I can code it up if someone can point me towards the things that need to be done. All the layers I've written so far are on the lua side and I'm not sure how to make the connection between cuda and lua. Thanks!

Soumith Chintala · Answer 46 · Sun Feb 28 2016 09:09:01 GMT+0800 (China Standard Time)

@abyravan i could get to cross next week.

Arunkumar Byravan · Answer 47 · Sun Feb 28 2016 09:17:12 GMT+0800 (China Standard Time)

That would be great. Thanks a lot! Is there any sort of a tutorial or howto on adding new functionality for tensors? Would be useful to have :)

Soumith Chintala · Answer 48 · Sun Feb 28 2016 09:19:00 GMT+0800 (China Standard Time)

You can look at existing PRs that are cross-linked in this thread.
Like:
#120
#96
#75

Soumith Chintala · Answer 49 · Sun Feb 28 2016 09:19:45 GMT+0800 (China Standard Time)

nonzero is being implemented by FB, should be out in a few days.

pvtokmakov · Answer 50 · Tue Jun 27 2017 00:35:34 GMT+0800 (China Standard Time)

Hi guys,

any hope on implementing the conv2 (or xcorr2 for that matter)?

Yicheng Luo · Answer 51 · Wed Aug 16 2017 05:25:25 GMT+0800 (China Standard Time)

Just adding a note that eye has been implemented here pytorch/pytorch#2395

Sambhav Jain · Answer 52 · Sat Dec 23 2017 00:56:58 GMT+0800 (China Standard Time)

Thanks for supporting many of the math functions in THC. I'm now waiting for histc!