sktime / sktime

A unified framework for machine learning with time series

Home Page:https://www.sktime.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] `unequal_length` clusterer bugs out in unequal data set

ggjx22 opened this issue · comments

Describe the bug
TimeSeriesKMeansTslearn is a clusterer which is able to be fitted with unequal length of time series within a data set (according to tags). But it does not seems to be doing what is it suppose to do. Possible bug/wrong usage/missing preprocessing step?

To Reproduce

from sktime.datasets import load_acsf1
from sktime.registry import all_estimators
from sktime.clustering.k_means._k_means_tslearn import TimeSeriesKMeansTslearn
RANDOM_STATE= 2
no_of_unknown_clusters = 5

# import data
X, _ = load_acsf1(return_type='pd-multiindex')

# remove last 10 rows from the last appliance to simulate unequal time series 
X_mod = X.iloc[:-10]

# instantiate clusterer which can handle unequal time series
# all_estimators('clusterer', as_dataframe=True, filter_tags={'capability:unequal_length': True})
unequal_clst = TimeSeriesKMeansTslearn(n_clusters=no_of_unknown_clusters, n_jobs=-1, random_state=RANDOM_STATE)

# fit the clusterer
unequal_clst.fit(X_mod) # error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[119], line 18
     15 unequal_clst = TimeSeriesKMeansTslearn(n_clusters=no_of_unknown_clusters, n_jobs=-1, random_state=RANDOM_STATE)
     17 # fit the clusterer
---> 18 unequal_clst.fit(X_mod)

File c:\Users\agpgago\data-science\forecast-wizard-test\venv\lib\site-packages\sktime\clustering\base.py:110, in BaseClusterer.fit(self, X, y)
    107 # reset estimator at the start of fit
    108 self.reset()
--> 110 X = self._check_clusterer_input(X)
    112 multithread = self.get_tag("capability:multithreading")
    113 if multithread:

File c:\Users\agpgago\data-science\forecast-wizard-test\venv\lib\site-packages\sktime\clustering\base.py:422, in BaseClusterer._check_clusterer_input(self, X, enforce_min_instances)
    420 unequal = not X_metadata["is_equal_length"]
    421 self._check_capabilities(missing, multivariate, unequal)
--> 422 return convert_to(
    423     X,
    424     to_type=self.get_tag("X_inner_mtype"),
    425     as_scitype="Panel",
    426 )

File c:\Users\agpgago\data-science\forecast-wizard-test\venv\lib\site-packages\sktime\datatypes\_convert.py:263, in convert_to(obj, to_type, as_scitype, store, store_behaviour, return_to_mtype)
    260 from_type = infer_mtype(obj=obj, as_scitype=as_scitype)
...
    610 X_values = X_coerced.values
--> 611 X_3d = X_values.reshape(n_instances, n_timepoints, n_columns).swapaxes(1, 2)
    613 return X_3d

ValueError: cannot reshape array of size 291990 into shape (200,1460,1)

For the above to work, I actually have to align the time period of all time series into the same range in order for the clusterer to work. Doesn't that defeat the purpose of having a clusterer that works for unequal length time series.

from sktime.transformations.panel.padder import PaddingTransformer

clst_pipe = PaddingTransformer() * unequal_clst
clst_pipe.fit(X_mod) # no errors

Expected behavior
Fitted clusterer on multiple time series which are unequal in length

Versions

System:
    python: 3.10.9 | packaged by Anaconda, Inc. | (main, Mar  1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)]
executable: [c:\Users\agpgago\data-science\forecast-wizard-test\venv\Scripts\python.exe](file:///C:/Users/agpgago/data-science/forecast-wizard-test/venv/Scripts/python.exe)
   machine: Windows-10-10.0.19042-SP0

Python dependencies:
          pip: 23.3.2
       sktime: 0.28.0
      sklearn: 1.4.1.post1
       skbase: 0.7.2
        numpy: 1.26.4
        scipy: 1.12.0
       pandas: 2.1.4
   matplotlib: 3.8.3
       joblib: 1.3.2
        numba: 0.58.1
  statsmodels: 0.14.1
     pmdarima: 2.0.4
statsforecast: 1.7.3
      tsfresh: 0.20.2
      tslearn: 0.6.3
        torch: None
   tensorflow: 2.15.0
tensorflow_probability: None

Yes, I think the tag is simply wrong, thanks for reporting.

I think the tag is wrong because this interfaces tslearn, and tslearn uses vanilla 3D numpy internally, which implies unequal length.

Explanation why the tests did not catch this: there were no unequal length test scenarios for clustering, unlike for classificatoin. I'll fix that too.