ZeroDivisionError when using DataFrame
manuelgodoy opened this issue · comments
I had been using version 0.1.5 without issues in the following script. But I decided to upgrade and now I see the following issue:
data = [43.3,62.9,55.2,48.6,67.1,421.5] # example data
new_array=pd.DataFrame(data)
scores = loop.LocalOutlierProbability(new_array).fit()
scores = scores.local_outlier_probabilities
Traceback:
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-6-1cb16c12004f> in <module>()
6
7 new_array=pd.DataFrame(l)
----> 8 scores = loop.LocalOutlierProbability(new_array).fit()
9 scores = scores.local_outlier_probabilities
10 np.where(scores > DETECTION_FACTOR)[0]
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in fit(self)
226 store = self._norm_prob_local_outlier_factors(store)
227 self.norm_prob_local_outlier_factor = np.max(store[:, 9])
--> 228 store = self._local_outlier_probabilities(store)
229 self.local_outlier_probabilities = store[:, 10]
230
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probabilities(self, data_store)
211 return np.hstack(
212 (data_store,
--> 213 np.array([np.apply_along_axis(self._local_outlier_probability, 0, data_store[:, 7], data_store[:, 9])]).T))
214
215 def fit(self):
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
114 except StopIteration:
115 raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 116 res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
117
118 # build a buffer for storing evaluations of func1d.
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyNomaly/loop.py in _local_outlier_probability(plof_val, nplof_val)
111 def _local_outlier_probability(plof_val, nplof_val):
112 erf_vec = np.vectorize(erf)
--> 113 return np.maximum(0, erf_vec(plof_val / (nplof_val * np.sqrt(2.))))
114
115 def _n_observations(self):
ZeroDivisionError: float division by zero
Thanks for identifying the above issue. I was able to replicate the issue and rectify it, see this commit. The issue stemmed from the fact that PyNomaly allowed n_neighbors to exceed the number of observations, which is not valid. A check has been added to ensure that n_neighbors is less than the number of observations. In the above, try setting the number of neighbors to a value less than the total number of observations. Thanks again for identifying the issue.
Great I will give it a try. I have a question though, why do sys.exit()
instead of throwing an Exception?
It’s merely preference. Since the user should not be using a number of neighbors greater than the number of observations, I provide a warning and exit the program. Could also easily throw an exception and let the user decide how to proceed, but prefer to exit the program since using a larger number of neighbors than observations isn’t mathematically valid according to the mathematical definition of LoOP.
Would you prefer to just throw an exception versus a sys.exit()? Doing so could provide a nicer experience than using sys.exit(). The package is still in development so open to discuss changes for future releases that would improve usability.
Maybe it doesn't matter. I think an Exception will achieve the same and the user can still "try/except" either one. I am just more used to the idea of an Exception.
Going back to the mathematical issue at hand. Is there an optimal n_neighbors for a particular dataset size?
The ideal number of neighbors really depends on the distribution of your data and the context in which you are wanting to apply the approach. The LoOP paper references an earlier paper that introduces LOF (local outlier factor). In that earlier paper, the authors suggest running the approach using different values of the number of neighbors, with the idea being that a true outlier or anomaly should be detected regardless of the neighborhood size. You could, for example, run LoOP with different values of n_neighbors on the same dataset and average the scores for each observation. Definitely think about it though and see if this approach would apply to your particular problem.
Hope this helps.