jmschrei / apricot

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error when providing initial_indices for sparse array data

chschroeder opened this issue · comments

Hi,

at first glance this library looks really nice (with regard to API, code and docs) and i really like it. Kudos for that!

The first steps were easy to follow using the examples.
However, when i switched from dense to sparse arrays i had some troubles:

Is FeatureBasedSelection in combination with the initial_subset argument intended to work on sparse arrays?
According to the documentation, scipy's csr_matrix should be supported, right?

(1) without initial_subset

selector = FeatureBasedSelection(n, concave_func='sqrt')
selector.fit(x)

(2) with initial_subset

selector = FeatureBasedSelection(n, concave_func='sqrt', initial_subset=initial_subset)
selector.fit(x)

Whenever x is an ndarray (dense) both variants work fine.
However, for a csr_matrix (sparse) only the former works, and for the latter i get the following error:

  File "<my_workspace>/my_script.py", line 86, in my_func
    selector.fit(x)
  File "<site-packges>/apricot/functions/featureBased.py", line 265, in fit
    return super(FeatureBasedSelection, self).fit(X, y=y, 
  File "<site-packges>/apricot/functions/base.py", line 251, in fit
    optimizer.select(X, self.n_samples, sample_cost=sample_cost)
  File "<site-packges>/apricot/optimizers.py", line 491, in select
    optimizer1.select(X, self.n_first_selections, sample_cost=sample_cost)
  File "<site-packges>/apricot/optimizers.py", line 234, in select
    gains = self.function._calculate_gains(X) / sample_cost[self.function.idxs]
  File "<site-packges>/apricot/functions/featureBased.py", line 321, in _calculate_gains
    concave_func(X.data, X.indices, X.indptr, gains, 
  File "<site-packges>/numba/core/dispatcher.py", line 608, in _explain_matching_error
    raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(float64, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(float64, 1d, C), array(float64, 2d, C), array(float64, 2d, C), array(int64, 1d, C)```

Howdy

Thanks for reporting this. It does look like a bug on my end. The selectors are supposed to work with both dense and sparse arrays, even when using an initial subset. I'll try to fix it in the next week ortwo. Sorry about that! If you need a fix sooner than that you should go into the FeatureBasedSelection code and just hard-code the gain _select_next function that you want to use.

Thanks for the quick response! There is no hurry at all. I am happy to hear that there will be a fix.