vc1492a / PyNomaly

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distance Matrix support

TSFelg opened this issue · comments

I'm currently using LOF for a Distance Matrix. Is it possible to also use a Distance Matrix for LoOP? Or are the points needed for the computation of the probabilities?

@TFelgueira thanks for opening this issue! In the current PyNomaly implementation, it is not possible to use a Distance Matrix as opposed to the actual values used in computing that distance matrix. This however could be introduced as a new feature in the current Numpy implementation. I've also been planning to transition PyNomaly to a scikit-learn code base in the future, and I believe that implementation would more readily support the use of a distance matrix. Would you be able to share a little bit more information about your use case? This will help me determine when would be an appropriate time to introduce this capability into PyNomaly. Thanks!

Of course, thank you for you interest!

I have many histograms which I want to perform clustering/outlier detection on. But histograms require specific statistical distances like Chi2 and Earth Mover's Distance, which are not available in most clustering/outlier detection tools. Hence why I've been calculating the distance matrix myself and then using HDSBCAN and LOF, for example, to cluster the histograms and/or find outliers.

The fact that LoOP adds the probabilistic view to LOF is a big advantage for my use case, hence why it would be a great help to have it accept distance matrices :)

@TFelgueira thanks for the information. I've decided to include this feature in the next release, 0.2.6. Before that happens, you can checkout this commit on the dev branch. It includes an implementation that allows you to provide a distance matrix and neighbor index matrix (i.e. unique IDs of the closest neighbors) in calculating the local outlier probability. A few things:

  • Your distance and neighbor index matrix must have the same shape (e.g. 150, 10 if your using the iris example and set to 10 neighbors)
  • The column dimension of your distance and neighbor index matrices must match the specified number of neighbors.
  • Only one, a set of data points or distance matrix, can be provided (not both). At least one must be provided.
  • When using the distance matrix, a neighbor index matrix must be provided.

Providing a distance matrix is not yet implemented for the stream functionality and I haven't written any unit tests yet for this new functionality - I hope to get to that soon and test this functionality more thoroughly before merging with master. In the meantime, checkout iris_dist_grid.py in the examples for using your own distance matrix with PyNomaly.

@TFelgueira I have merged dev with master and released this feature as part of the 0.2.6 release. With the most recent version, you can now specify a distance matrix and a neighbor matrix and use those matrices to calculate the LoOP.