fxia22 / kdnet.pytorch

implementation "Escape from Cells: Deep Kd-Networks for The Recognition of 3D Point Cloud Models" in pytorch

Home Page:https://arxiv.org/abs/1704.01222

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scaling

wassname opened this issue · comments

Thanks for making and sharing this.

I'm evaluating kdnet for use with dense pointclouds, which means looking at the speed and memory footprint. Here are my results:

depth input points model params (K) ram usage (mb) gpu ram usage (mb) forward time (ms) kdtree time (ms)
11 2,018 3,705 1,986 468 3.6 268
13 8,192 3,715 2,104 413 3.7 1,000
14 16,384 3,727 1,956 501 3.8 2,000
16 65,536 3,801 2,176 621 8.0 8,000
17 131,072 3,899 2,221 815 11.4 17,000
19 524,288 4,500 2,574 1,723 100.0 60,000

So it looks like it scales pretty well but the split_ps function which is called to build the kdtree is the bottleneck for speed (last column). I can't see a whole lot of room for improvement since it's all fairly optimal pytorch code.

Do you think there's much opportunity for split_ps to be optimized?

@wassname Thanks a lot for benchmarking this. I am aware of the performance bottleneck when writing this piece of code. The best way to fix this would be to write a cffi binding and do the indexing in C++/CUDA. However I don't think I will have time to fix this now. Pull requests are welcome :)

That understandable. Cheers for the guidance on optimizing it, I might give that a try!

FYI: I haven't got it to a PR stage, but it looks like you can use scipy's cKDNet to get much better speed and scaling. Here's the branch I was working on. Here's a test notebook.

Looks promising!

I added a PR for this feature, it improved the speed of kdtree's a lot. Even though it's much faster, I still found myself precomputing and caching kdtree's when training on large pointclouds.

Hi @fxia22 , I have some general question about the indexing performance. It seems like torch.index_select is copying the input tensor into the memory. So if your number of nearest neighbor is 10, then you're using 10x memory. Is there any way you could share the memory of the original tensor when you call index_select?

Thank you.