jmschrei / apricot

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory Error for Large Dataset

devinity1337 opened this issue · comments

I'm trying the FacilityLocationSelection with a dataset of 7000000 samples, using Hamming distance as a metric. I get a memory error:

numpy.core._exceptions.MemoryError: Unable to allocate 196. TiB for an array with shape (26998899737040,) and data type float64
In my previous dataset I had 2500000 samples, and the code worked fine. I assume the issue is calculating the distances between every sample. I'm wondering if there are any options for reducing the memory cost- I've also tried faster optimizers and different functions but I run into the same error.

Howdy. The clear solution is to get a computer with 196 TiB of memory. But, if that's not possible, there are three solutions.

The first is to precompute a sparse similarity matrix and feed that into apricot. See: https://github.com/jmschrei/apricot/blob/master/tutorials/3.%20Using%20Sparse%20Inputs.ipynb

The second solution is to use only a limited number of nearest neighbors. Constructing the KNN tree might take some extra time but should scale to massive inputs. You can set this by passing in a number of n_neighbors when constructing the object. See: https://apricot-select.readthedocs.io/en/latest/functions/facilityLocation.html

The third is to use streaming submodular optimization: https://apricot-select.readthedocs.io/en/latest/features/streaming.html The gist here is that you make one scan over the entire data set, choosing elements in a greedy manner whose gain is above a certain threshold. You will not get a subset that is as good, but it should scale to any number of elements.

Thank-you for the comprehensive reply!