Memory Error for Large Dataset

Question

Memory Error for Large Dataset

devinity1337 opened this issue 3 years ago · comments

I'm trying the FacilityLocationSelection with a dataset of 7000000 samples, using Hamming distance as a metric. I get a memory error:

numpy.core._exceptions.MemoryError: Unable to allocate 196. TiB for an array with shape (26998899737040,) and data type float64
In my previous dataset I had 2500000 samples, and the code worked fine. I assume the issue is calculating the distances between every sample. I'm wondering if there are any options for reducing the memory cost- I've also tried faster optimizers and different functions but I run into the same error.

Jacob Schreiber · Answer 1 · Fri Oct 08 2021 01:17:36 GMT+0800 (China Standard Time)

Howdy. The clear solution is to get a computer with 196 TiB of memory. But, if that's not possible, there are three solutions.

The first is to precompute a sparse similarity matrix and feed that into apricot. See: https://github.com/jmschrei/apricot/blob/master/tutorials/3.%20Using%20Sparse%20Inputs.ipynb

The second solution is to use only a limited number of nearest neighbors. Constructing the KNN tree might take some extra time but should scale to massive inputs. You can set this by passing in a number of n_neighbors when constructing the object. See: https://apricot-select.readthedocs.io/en/latest/functions/facilityLocation.html

The third is to use streaming submodular optimization: https://apricot-select.readthedocs.io/en/latest/features/streaming.html The gist here is that you make one scan over the entire data set, choosing elements in a greedy manner whose gain is above a certain threshold. You will not get a subset that is as good, but it should scale to any number of elements.

devinity1337 · Answer 2 · Fri Oct 08 2021 16:27:21 GMT+0800 (China Standard Time)

Thank-you for the comprehensive reply!