jmschrei / apricot

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Initial Subset with Streaming SO

devinity1337 opened this issue · comments

from apricot import FacilityLocationSelection
import numpy

X = numpy.exp(numpy.random.randn(1000, 50))
print(X)

X_corr = numpy.corrcoef(X) ** 2

model = FacilityLocationSelection(10, 'corr',initial_subset=[1, 5, 6, 8, 10])
model.partial_fit(X)

print(model.ranking, X_corr[model.ranking].max(axis=0).sum())

Gives an error:

ValueError: operands could not be broadcast together with shapes (995,) (1000,)

The size mismatch is equal to the initial subset size and the error only occurs when I use an initial subset so something goes wrong with the streaming with the initial subset.

Oof, thanks for the report! Okay, I can look into it. I'm not sure when I'll get to it, so you might be better suited looking into an alternate approach in the meantime but I'll let you know when I fix it.

Thanks. Any insight into the best number of nearest neighbors to use? I've started with 1000.

Also, where to put the "pre-computed" distances for the sparse matrix encoding? I don't see it as a parameter.

from apricot import FacilityLocationSelection
import numpy
from scipy.sparse import csr_matrix

X = numpy.random.uniform(0, 1, size=(6000, 6000))
X = (X + X.T) / 2.
X[X < 0.9] = 0.0
X_sparse = csr_matrix(X)

#FacilityLocationSelection(500, 'precomputed', verbose=True).fit(X)
FacilityLocationSelection(500, 'precomputed', verbose=True).fit(X_sparse)

The code seems to work, but what distance metric is it actually using?