Use Trees for data structures

Question

Use Trees for data structures

maxcw opened this issue 4 years ago · comments

It looks like all distances are currently being calculated, which is expensive. Borrowing from sklearn, BallTree and KDTree could be used to speed up nearest neighbor calculations.

Valentino Constantinou · Answer 1 · Sat May 02 2020 07:45:04 GMT+0800 (China Standard Time)

Thanks @maxcw! There is a parallel effort for integrating Local Outlier Probabilities (LoOP) into scikit-learn, see this pull request.

It was something I was working on some time ago but haven't updated due to lack of time and interest from others. PyNomaly aims to be a standalone library which minimizes the number of required dependencies. There are ways to improve the speed of the current code for nearest neighbor calculations (like the parallelism issue you opened), but integrating scikit-learn capability into PyNomaly is not something that's currently in scope. It may be a better idea to update the PR with refreshed code that enables LoOP in scikit-learn (and thus the fast nearest neighbor calculations). If you want to contribute to that PR, I would be willing to jump back in and do it together. I don't think adding a dependency on scikit-learn is the right approach - however open to other suggestions as to how to improve the speed of the nearest neighbor calculations using trees as the data structures.

Not sure if you are aware or if this helps your work, but when using PyNomaly you can always bring your own distance and neighbor matrix as is shown in the examples in the readme, meaning using external libraries to calculate the distance and neighbor matrix is an option - simply provide those to PyNomaly thereafter.