jmschrei / apricot

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

partial_fit behavior for Facility Location

610v4nn1 opened this issue · comments

I am using the library with great results, but I am struggling to understand if the behavior reported below is correct.

I am loading batches of data from the disk and then calling partial_fit on every batch. Let's assume that each batch contains 1000 data points, and I load 10 batches, the data points will be indexed from 0 to 9999.
After each partial fit call, I store the indices returned in the ranking and the associated data points (e.g., [1, 2,.... 100]).
I would expect the list of the indices returned by partial_fit at batch B should be either from the list of the stored indices or from the indices of the batch B but that does not seem to be the case when I use facility location (empirically it is the case for the feature based selection).

For example, if my stored indices are pi = [1,2,5,90] and the current batch contains the points with in bi = [100, 101, ..., 110], I would expect that given ci = selector.ranking to have ci in pi + bi. Is this a wrong assumption?
If my assumption is wrong, how do you retrieve the points associated to the selected indices without a second pass on the data?

The stored indices should be relative to the beginning of the stream. If you selected the 1000th example, regardless of the number of batches, you should have index 1000 stored (or I guess 999 since it's 0-based). See this code here that keeps a running total of the number of examples you've seen: https://github.com/jmschrei/apricot/blob/master/apricot/optimizers.py#L1101

If you just want the examples and not necessarily the indexes you can use the subset attribute (see https://github.com/jmschrei/apricot/blob/master/apricot/optimizers.py#L1122). The way that the sieve algorithm (behind the partial fit method) works is that it makes several guesses as to how good the best example will be and, using math, can turn that into an efficient one-pass algorithm. The one-pass algorithm is then run in parallel for each guess and at the end of each batch the guess closest to the truth has their subset returned.

subset is very useful but it is not what I need since I have labels associated with my data points. I think I need to keep the indices.

Anyway, what I am observing right now is that when calling partial_fit on a batch of points in the range (i,j) the ranking may contain indices which are smaller than i and which were not present in the previously stored list.

I can't be positive this is what's happening without seeing the example, but I'm almost certain it's related to the procedure I described above. Like I said, the streaming algorithm maintains several parallel subsets that were selected based on different quality estimates. After each partial_fit call, you're given the subset that had the closest estimate given the entire stream so far. From one partial_fit call to the next this can change, and a different subset has the best estimate, meaning the entire returned subset is changed. Unfortunately, it's difficult to get good estimates over the entire stream if your only choices are reject the next example or use it to replace an already selected example within a single subset.

I don't think this is a particularly complicated matter, it would be enough to provide a method retrieving the indices the points in the union over all subsets.

The sieve_selections_ attribute on the selector should contain all the indices. https://github.com/jmschrei/apricot/blob/master/apricot/optimizers.py#L1113

Terrific, thanks!

It works perfectly, thanks a lot for your help and all the work you put in this library.