jmschrei / apricot

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Raising ValueError erroneously

garethclews opened this issue · comments

commented

I have installed apricot 0.2.3 from pypi and following the example in the README throws a ValueError.

import numpy
from apricot import FacilityLocationSelection

X = numpy.random.normal(100, 1, size=(1000, 25))
X_subset = FacilityLocationSelection(100).fit_transform(X)

results in:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-15da598d409a> in <module>()
...snip...
~/.pyenv/versions/3.6.4/lib/python3.6/site-packages/apricot/base.py in fit(self, X, y)
    108                         raise ValueError("X must have exactly two dimensions.")
    109                 if numpy.min(X) < 0.0 and numpy.max(X) > 0.:
--> 110 			raise ValueError("X cannot contain negative values or must be entirely "\
    111 				"negative values.")
    112

ValueError: X cannot contain negative values or must be entirely negative values.

but numpy.min(X) < 0.0 and numpy.max(X) > 0. returns False for the X generated.

System information:

  • Mac OS 10.13.6
  • Python 3.6.4
  • numpy 1.14.5
  • numba 0.39.0

Hi @karetsu

Thanks for reporting the bug. It cropped up in a small optimization I tried to do. When you're using facility location functions, it's not the data set that needs to be non-negative (or entirely negative), it's the pairwise similarity (whereas with feature based methods the data themselves need to be stringly non-negative). The default similarity for facility location is Euclidean distance, of which the distance between a point and itself is 0 in theory, or positive floating point precision in implementation, which is annoying. I resolved the issue by subtracting out the maximal value (typically around machine precision) along the diagonal when using euclidean distance as the similarity. Try getting the latest code (0.2.4 on PyPI) and letting me know if that fixed the issue.

Please re-open if you encounter this issue again.