vc1492a / PyNomaly

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ZeroDivisionError when using DataFrame

manuelgodoy opened this issue · comments

I had been using version 0.1.5 without issues in the following script. But I decided to upgrade and now I see the following issue:

data = [43.3,62.9,55.2,48.6,67.1,421.5] # example data
new_array=pd.DataFrame(data)
scores = loop.LocalOutlierProbability(new_array).fit()
scores = scores.local_outlier_probabilities

Traceback:

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-6-1cb16c12004f> in <module>()
      6 
      7 new_array=pd.DataFrame(l)
----> 8 scores = loop.LocalOutlierProbability(new_array).fit()
      9 scores = scores.local_outlier_probabilities
     10 np.where(scores > DETECTION_FACTOR)[0]

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in fit(self)
    226         store = self._norm_prob_local_outlier_factors(store)
    227         self.norm_prob_local_outlier_factor = np.max(store[:, 9])
--> 228         store = self._local_outlier_probabilities(store)
    229         self.local_outlier_probabilities = store[:, 10]
    230 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probabilities(self, data_store)
    211         return np.hstack(
    212             (data_store,
--> 213              np.array([np.apply_along_axis(self._local_outlier_probability, 0, data_store[:, 7], data_store[:, 9])]).T))
    214 
    215     def fit(self):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    114     except StopIteration:
    115         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 116     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    117 
    118     # build a buffer for storing evaluations of func1d.

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probability(plof_val, nplof_val)
    111     def _local_outlier_probability(plof_val, nplof_val):
    112         erf_vec = np.vectorize(erf)
--> 113         return np.maximum(0, erf_vec(plof_val / (nplof_val * np.sqrt(2.))))
    114 
    115     def _n_observations(self):

ZeroDivisionError: float division by zero

Thanks for identifying the above issue. I was able to replicate the issue and rectify it, see this commit. The issue stemmed from the fact that PyNomaly allowed n_neighbors to exceed the number of observations, which is not valid. A check has been added to ensure that n_neighbors is less than the number of observations. In the above, try setting the number of neighbors to a value less than the total number of observations. Thanks again for identifying the issue.

Great I will give it a try. I have a question though, why do sys.exit() instead of throwing an Exception?

It’s merely preference. Since the user should not be using a number of neighbors greater than the number of observations, I provide a warning and exit the program. Could also easily throw an exception and let the user decide how to proceed, but prefer to exit the program since using a larger number of neighbors than observations isn’t mathematically valid according to the mathematical definition of LoOP.

Would you prefer to just throw an exception versus a sys.exit()? Doing so could provide a nicer experience than using sys.exit(). The package is still in development so open to discuss changes for future releases that would improve usability.

Maybe it doesn't matter. I think an Exception will achieve the same and the user can still "try/except" either one. I am just more used to the idea of an Exception.

Going back to the mathematical issue at hand. Is there an optimal n_neighbors for a particular dataset size?

The ideal number of neighbors really depends on the distribution of your data and the context in which you are wanting to apply the approach. The LoOP paper references an earlier paper that introduces LOF (local outlier factor). In that earlier paper, the authors suggest running the approach using different values of the number of neighbors, with the idea being that a true outlier or anomaly should be detected regardless of the neighborhood size. You could, for example, run LoOP with different values of n_neighbors on the same dataset and average the scores for each observation. Definitely think about it though and see if this approach would apply to your particular problem.

Hope this helps.