jmschrei / apricot

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent Selector gains and duplicate elements in Selector ranking in case of Facility Locations function in v0.6.0

bhattg opened this issue · comments

Hi,

I was working with two versions of Apricot, 0.5.0 and 0.6.0. I am using the Facility Locations Function, with a precomputed setting, by passing my own matrix. I have two kinds of matrices - one highly sparse (which even contains zero diagonal elements) and another one with a lot of 1s. The dimension of each matrix is (1000,1000) and I have to select 800 elements.
I have attached a folder containing input matrices and the jupyter notebooks, including the output runs

Problem: Inconsistent Selector gains and duplicate elements in Selector ranking in case of Facility Locations function in v0.6.0

Now when I pass these matrices, to get the selector gains and ranking of indices, there are the following problems

  • The selector gains, which should have been in descending order, in the case of v0.6.0, are not in descending order.
  • Moreover, the rankings array contains duplicate elements (which shouldn't have happened)

This finding is consistent for both the matrices (sparse, and the one with a lot of 1s).

However, this issue is not there in v0.5.0; gains are in descending order, and the ranking array does not contain any duplicate element.

Please find the attached Jupiter notebook and the corresponding file in the folder, to reproduce the error.
Issue.zip

That's weird, looks like there is a bug in what's being reported in the ranking and gains attributes. For instance, if you do the following with the 0.6 notebook:

selector1 = FacilityLocationSelection(800, metric='precomputed', optimizer='naive')
selector2 = FacilityLocationSelection(800, metric='precomputed', optimizer='lazy')

selector1.fit(x.numpy())
selector2.fit(x.numpy())

selector1.ranking[selector1.gains != 0]
# array([  0, 671, 228, 165, 930, 563, 534, 610,  41, 675, 213, 500, 509,
       255, 293, 608, 963, 204, 253,  15, 662, 284, 503, 909, 369, 649,
       701, 241, 643, 877, 153, 463, 754, 388, 679, 724, 219, 573, 758,
       259, 706,  34, 703,  32, 105, 694, 792, 380, 603, 214,  74, 147,
       845, 159,  80, 892, 911, 844, 919, 622, 149, 258, 464, 203, 956,
       595, 405, 972, 114, 789, 131, 809, 815, 328, 808, 211, 329, 605,
       123, 496, 653,  82, 280, 311])

selector2.ranking[selector2.gains != 0]
# array([  0, 671, 228, 165, 930, 563, 534, 610,  41, 675, 213, 500, 509,
       255, 293, 608, 963, 204, 253,  15, 662, 284, 503, 909, 369, 649,
       701, 241, 643, 877, 153, 463, 754, 388, 679, 724, 219, 573, 758,
       259, 706,  34, 703,  32, 105, 694, 792, 380, 603, 214,  74, 147,
       845, 159,  80, 892, 911, 844, 919, 622, 149, 258, 464, 203, 956,
       595, 405, 972, 114, 789, 131, 809, 815, 328, 808, 211, 329, 605,
       123, 496, 653,  82, 280, 311])

you get the right answer. Both the gains and the rankings are right there. I'll have to look into why a bunch of 0 gains are added in the front for the lazy selection.

As for what you can do now, I'd suggest switching to the naive optimizer. The naive optimizer is robust against almost all issues that will effect the lazy optimizer, because the lazy optimizer has to deal with the hassle and edge cases of maintaining a priority queue.

I had a question about your data though. apricot uses kernel matrices, in the sense that 1 means perfect similarity and 0 means perfect dissimilarity. Are you sure that you're not feeding in a dissimilarity or distance matrix? There are a ton of 1's in here, meaning your data is super redundant.

Hi,
Thanks for a quick response. I have a question related to v0.5.0

  • Is it OK to use v0.5.0 for the time being? I did not find this issue in that version.

Coming to your query,

  • Yes the data actually IS highly redundant, and only very few of the entries are actually important, and even though here I am choosing 800 entries, they too have high redundancy. This is kind of a corner case in the work that I am dealing with, however, it should not matter from a submodularity point of view.

Gotcha, just checking that the redundancy is expected.

Yes, using 0.5 should be fine. I think that 0.6 just added in new selectors and made the existing ones faster (but apparently introduced this issue that the unit tests missed).

Alright, I will use v0.5. Are there any other differences between v0.5 and 0.6?

I looked into this a bit more and the issue looks somewhat benign. Essentially, there is a bug in apricot, but it's an edge case related to the structure of your data. If you make a heatmap of your matrix you get the following:

image

You can see that, basically, there are rows and columns comprised of exactly the same value. This means that, essentially, a single one of the redundant samples can explain every sample just as well as any of the other samples. Thus, submodular optimization involves first choosing one of the redundant samples (with no real preference for which one). Then, it chooses the non-redundant ones in reverse order of their similarity to the redundant examples, with each selection only providing a gain by explaining itself. Basically, after the first step, adding more items doesn't explain any of the OTHER examples.

This shouldn't be a big problem, normally, because submodular optimization doesn't require that the gain of every item goes down each iteration-- it can also stay the same. However, in v0.6.0 I added a small optimization to the priority queue that terminates it when the gain it finds is 0 and the change is 0. Obviously this is somewhat an issue for data structured like yours, where a bunch of items are perfectly redundant and after the first selection the items are almost perfectly non-redundant.

You can check this hypothesis by adding a small amount of random Gaussian noise to the data to mean that the gain is never perfectly 0-- and that's exactly what you see.

x = aa['gram'][0] + numpy.abs(numpy.random.normal(0, 0.00001, size=(1000, 1000)))

selector1 = FacilityLocationSelection(800, metric='precomputed', optimizer='naive')
selector2 = FacilityLocationSelection(800, metric='precomputed', optimizer='lazy')

selector1.fit(x.numpy())
selector2.fit(x.numpy())

(selector1.ranking[:50] != selector2.ranking[:50]).sum()
# 0

tl;dr the issue is minor so I might not get around to it for a bit. you can use 0.5.0 because it doesn't include this "optimization."

Here's a twitter thread about what I've added in v0.6: https://twitter.com/jmschreiber91/status/1311358116353642496?s=20

Thanks for the detailed response!