scaling

Question

scaling

wassname opened this issue 6 years ago · comments

Thanks for making and sharing this.

I'm evaluating kdnet for use with dense pointclouds, which means looking at the speed and memory footprint. Here are my results:

depth	input points	model params (K)	ram usage (mb)	gpu ram usage (mb)	forward time (ms)	kdtree time (ms)
11	2,018	3,705	1,986	468	3.6	268
13	8,192	3,715	2,104	413	3.7	1,000
14	16,384	3,727	1,956	501	3.8	2,000
16	65,536	3,801	2,176	621	8.0	8,000
17	131,072	3,899	2,221	815	11.4	17,000
19	524,288	4,500	2,574	1,723	100.0	60,000

So it looks like it scales pretty well but the split_ps function which is called to build the kdtree is the bottleneck for speed (last column). I can't see a whole lot of room for improvement since it's all fairly optimal pytorch code.

Do you think there's much opportunity for split_ps to be optimized?

Fei Xia · Answer 1 · Wed Apr 25 2018 01:32:08 GMT+0800 (China Standard Time)

@wassname Thanks a lot for benchmarking this. I am aware of the performance bottleneck when writing this piece of code. The best way to fix this would be to write a cffi binding and do the indexing in C++/CUDA. However I don't think I will have time to fix this now. Pull requests are welcome :)

Michael J Clark · Answer 2 · Wed Apr 25 2018 05:59:22 GMT+0800 (China Standard Time)

That understandable. Cheers for the guidance on optimizing it, I might give that a try!

Michael J Clark · Answer 3 · Wed Apr 25 2018 15:19:16 GMT+0800 (China Standard Time)

FYI: I haven't got it to a PR stage, but it looks like you can use scipy's cKDNet to get much better speed and scaling. Here's the branch I was working on. Here's a test notebook.

Fei Xia · Answer 4 · Thu Apr 26 2018 03:22:50 GMT+0800 (China Standard Time)

Looks promising!

Michael J Clark · Answer 5 · Tue May 01 2018 13:55:41 GMT+0800 (China Standard Time)

I added a PR for this feature, it improved the speed of kdtree's a lot. Even though it's much faster, I still found myself precomputing and caching kdtree's when training on large pointclouds.

jjparkcv · Answer 6 · Sat Jul 27 2019 02:02:27 GMT+0800 (China Standard Time)

Hi @fxia22 , I have some general question about the indexing performance. It seems like torch.index_select is copying the input tensor into the memory. So if your number of nearest neighbor is 10, then you're using 10x memory. Is there any way you could share the memory of the original tensor when you call index_select?

Thank you.