Initial Subset with Streaming SO

Question

Initial Subset with Streaming SO

devinity1337 opened this issue 3 years ago · comments

from apricot import FacilityLocationSelection
import numpy

X = numpy.exp(numpy.random.randn(1000, 50))
print(X)

X_corr = numpy.corrcoef(X) ** 2

model = FacilityLocationSelection(10, 'corr',initial_subset=[1, 5, 6, 8, 10])
model.partial_fit(X)

print(model.ranking, X_corr[model.ranking].max(axis=0).sum())

Gives an error:

ValueError: operands could not be broadcast together with shapes (995,) (1000,)

The size mismatch is equal to the initial subset size and the error only occurs when I use an initial subset so something goes wrong with the streaming with the initial subset.

Jacob Schreiber · Answer 1 · Sat Oct 09 2021 02:28:50 GMT+0800 (China Standard Time)

Oof, thanks for the report! Okay, I can look into it. I'm not sure when I'll get to it, so you might be better suited looking into an alternate approach in the meantime but I'll let you know when I fix it.

devinity1337 · Answer 2 · Wed Oct 13 2021 19:57:13 GMT+0800 (China Standard Time)

Thanks. Any insight into the best number of nearest neighbors to use? I've started with 1000.

Also, where to put the "pre-computed" distances for the sparse matrix encoding? I don't see it as a parameter.

Jacob Schreiber · Answer 3 · Thu Oct 14 2021 01:50:00 GMT+0800 (China Standard Time)

If you use precomputed distances you can set `metric="precomputed"` and then pass the sparse matrix into `fit` or `fit_transform` as normal. I think that there's been some work suggesting that using log2(n_examples) neighbors is sufficient to achieve some theoretical properties, but I can't remember what those properties are.

…

On Wed, Oct 13, 2021 at 4:57 AM devinity1337 ***@***.***> wrote: Thanks. Any insight into the best number of nearest neighbors to use? I've started with 1000. Also, where to put the "pre-computed" distances for the sparse matrix encoding? I don't see it as a parameter. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA54IEAVOJ734MLALCXCKITUGVX2JANCNFSM5FTVTRWA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

devinity1337 · Answer 4 · Thu Oct 14 2021 13:57:20 GMT+0800 (China Standard Time)

from apricot import FacilityLocationSelection
import numpy
from scipy.sparse import csr_matrix

X = numpy.random.uniform(0, 1, size=(6000, 6000))
X = (X + X.T) / 2.
X[X < 0.9] = 0.0
X_sparse = csr_matrix(X)

#FacilityLocationSelection(500, 'precomputed', verbose=True).fit(X)
FacilityLocationSelection(500, 'precomputed', verbose=True).fit(X_sparse)

The code seems to work, but what distance metric is it actually using?

Jacob Schreiber · Answer 5 · Thu Oct 14 2021 14:03:19 GMT+0800 (China Standard Time)

It assumes that you're passing in a similarity matrix yourself where 1 is most similar, as opposed to 0 meaning least distant, rather than calculating anything itself. A problem is that most standard similarity functions don't produce sparse similarity matrices, even if many of the elements are small. If you manually produce a similarity matrix that is sparse, it knows how to use that sparsity to speed up the algorithm, though.

…

On Wed, Oct 13, 2021 at 10:57 PM devinity1337 ***@***.***> wrote: from apricot import FacilityLocationSelection import numpy from scipy.sparse import csr_matrix X = numpy.random.uniform(0, 1, size=(6000, 6000)) X = (X + X.T) / 2. X[X < 0.9] = 0.0 X_sparse = csr_matrix(X) #FacilityLocationSelection(500, 'precomputed', verbose=True).fit(X) FacilityLocationSelection(500, 'precomputed', verbose=True).fit(X_sparse) The code seems to work, but what distance metric is it actually using? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA54IEC7PROYQEHWANEZ5L3UGZWMVANCNFSM5FTVTRWA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.