jmschrei / apricot

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`partial_fit` and `sieve` can easily outgrow available memory

nucflash opened this issue · comments

Thank you for putting together such a great library. It's been extremely helpful.

I was toying with the parameters in the example in the documentation on massive datasets. I realized that when using partial_fit (and therefore the sieve optimizer) and slightly more features or I set my target sample size to something larger, it is easy to get a memory error. Here is an example that I tried:

# apricot-massive-dataset-example.py
from apricot import FeatureBasedSelection
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

train_data = fetch_20newsgroups(subset='train', categories=('sci.med', 'sci.space'))
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(train_data.data) # This returns a sparse matrix which is supported in apricot
print(X_train.shape)

selector = FeatureBasedSelection(1000, concave_func='sqrt', verbose=False)
selector.partial_fit(X_train)

Running the above, I get:

$ python apricot-massive-dataset-example.py
(1187, 25638)
Traceback (most recent call last):
  File "apricot-example.py", line 12, in <module>
    selector.partial_fit(X_train)
  File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 258, in partial_fit
    self.optimizer.select(X, self.n_samples, sample_cost=sample_cost)
  File "/envs/bla/lib/python3.8/site-packages/apricot/optimizers.py", line 1103, in select
    self.function._calculate_sieve_gains(X, thresholds, idxs)
  File "/envs/bla/lib/python3.8/site-packages/apricot/functions/featureBased.py", line 360, in _calculate_sieve_gains
    super(FeatureBasedSelection, self)._calculate_sieve_gains(X,
  File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 418, in _calculate_sieve_gains
    self.sieve_subsets_ = numpy.zeros((l, self.n_samples, self._X.shape[1]), dtype='float32')
numpy.core._exceptions.MemoryError: Unable to allocate 117. GiB for an array with shape (1227, 1000, 25638) and data type float32

This behavior doesn't happen when I use fit() and another optimizer, e.g., two-stage.

Looking into the code, it seems that the root is at an array initialization of sieve_subsets_, and can happen again later here. In both places, we ask for a zero, float64, non-sparse matrix, of size |thresholds| x |n_samples| x |feature_dims|, which can become quite large and not fit in memory when dealing with massive datasets. I wonder if there is a more memory efficient way of writing it? Thanks!