Memory Error for Large Dataset
devinity1337 opened this issue · comments
I'm trying the FacilityLocationSelection with a dataset of 7000000 samples, using Hamming distance as a metric. I get a memory error:
numpy.core._exceptions.MemoryError: Unable to allocate 196. TiB for an array with shape (26998899737040,) and data type float64
In my previous dataset I had 2500000 samples, and the code worked fine. I assume the issue is calculating the distances between every sample. I'm wondering if there are any options for reducing the memory cost- I've also tried faster optimizers and different functions but I run into the same error.
Howdy. The clear solution is to get a computer with 196 TiB of memory. But, if that's not possible, there are three solutions.
The first is to precompute a sparse similarity matrix and feed that into apricot. See: https://github.com/jmschrei/apricot/blob/master/tutorials/3.%20Using%20Sparse%20Inputs.ipynb
The second solution is to use only a limited number of nearest neighbors. Constructing the KNN tree might take some extra time but should scale to massive inputs. You can set this by passing in a number of n_neighbors
when constructing the object. See: https://apricot-select.readthedocs.io/en/latest/functions/facilityLocation.html
The third is to use streaming submodular optimization: https://apricot-select.readthedocs.io/en/latest/features/streaming.html The gist here is that you make one scan over the entire data set, choosing elements in a greedy manner whose gain is above a certain threshold. You will not get a subset that is as good, but it should scale to any number of elements.
Thank-you for the comprehensive reply!