scaling
wassname opened this issue · comments
Thanks for making and sharing this.
I'm evaluating kdnet for use with dense pointclouds, which means looking at the speed and memory footprint. Here are my results:
depth | input points | model params (K) | ram usage (mb) | gpu ram usage (mb) | forward time (ms) | kdtree time (ms) |
---|---|---|---|---|---|---|
11 | 2,018 | 3,705 | 1,986 | 468 | 3.6 | 268 |
13 | 8,192 | 3,715 | 2,104 | 413 | 3.7 | 1,000 |
14 | 16,384 | 3,727 | 1,956 | 501 | 3.8 | 2,000 |
16 | 65,536 | 3,801 | 2,176 | 621 | 8.0 | 8,000 |
17 | 131,072 | 3,899 | 2,221 | 815 | 11.4 | 17,000 |
19 | 524,288 | 4,500 | 2,574 | 1,723 | 100.0 | 60,000 |
So it looks like it scales pretty well but the split_ps function which is called to build the kdtree is the bottleneck for speed (last column). I can't see a whole lot of room for improvement since it's all fairly optimal pytorch code.
Do you think there's much opportunity for split_ps
to be optimized?
@wassname Thanks a lot for benchmarking this. I am aware of the performance bottleneck when writing this piece of code. The best way to fix this would be to write a cffi binding and do the indexing in C++/CUDA. However I don't think I will have time to fix this now. Pull requests are welcome :)
That understandable. Cheers for the guidance on optimizing it, I might give that a try!
Looks promising!
I added a PR for this feature, it improved the speed of kdtree's a lot. Even though it's much faster, I still found myself precomputing and caching kdtree's when training on large pointclouds.
Hi @fxia22 , I have some general question about the indexing performance. It seems like torch.index_select is copying the input tensor into the memory. So if your number of nearest neighbor is 10, then you're using 10x memory. Is there any way you could share the memory of the original tensor when you call index_select?
Thank you.