johannfaouzi / pyts

A Python package for time series classification

Home Page:https://pyts.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory Error when performing SSA on a big matrix

hamiddimyati opened this issue · comments

Description

When running SSA on hundreds of time series data (big matrix), my Jupyter notebook keeps restarting without any clues what is going on. After investigating from the code, the problem is in the X_elem = _outer_dot(v, X_window, n_samples, window_size, n_windows) in the transform function. For more details, when I run SSA on 550 time series data with 480 timestamps with a window size 35% (168 timestamps), it turns out that this function creates a very big matrix that crashes my notebook (X_new = np.empty((n_samples, window_size, window_size, n_windows))). Maybe it's better to put MemoryError exception in the function.

Steps/Code to Reproduce

from pyts.decomposition import SingularSpectrumAnalysis
ssa_params = {'window_size': 0.35, 'groups': None}
arr_input = np.random.rand(550,480)
arr_output = SingularSpectrumAnalysis(**ssa_params).fit_transform(arr_input)

Versions

NumPy 1.20.3
SciPy 1.6.2
Scikit-Learn 0.24.2
Numba 0.54.0
Pyts 0.11.0

Thank you for pointing this out.

The good news is that the decomposition is independent for each time series, so it can be performed on the whole dataset using batches/chunks. It is likely to take longer but it would require less memory.

I had a quick look at scikit-learn source code and found one instance in which they use a try except loop to avoid this issue.

pandas.read_csv has a chunksize parameter to read a CSV file using chunks, which is useful when one wants to apply a processing to a large DataFrame that will decrease the memory usage. Instead of loading the whole CSV file as a single DataFrame and applying the processing on the whole DataFrame, it loads the CSV file by chunks and one can apply the processing on each chunk one at a time. Likewise, it decreases memory usage at the expense of runtime.

I think that both solutions could be implemented:

  • The try except loop would just be a check to see if the whole dataset (or each chunk) can be processed at once.
  • A chunksize parameter could be added to set the number of time series in each chunk: the processing would then be applied on each chunk of time series, one at a time.

What do you think?

Awesome! Thanks for providing me with concrete solutions. I will definitely try those options.