[BUG] `unequal_length` clusterer bugs out in unequal data set
ggjx22 opened this issue · comments
Describe the bug
TimeSeriesKMeansTslearn
is a clusterer which is able to be fitted with unequal length of time series within a data set (according to tags). But it does not seems to be doing what is it suppose to do. Possible bug/wrong usage/missing preprocessing step?
To Reproduce
from sktime.datasets import load_acsf1
from sktime.registry import all_estimators
from sktime.clustering.k_means._k_means_tslearn import TimeSeriesKMeansTslearn
RANDOM_STATE= 2
no_of_unknown_clusters = 5
# import data
X, _ = load_acsf1(return_type='pd-multiindex')
# remove last 10 rows from the last appliance to simulate unequal time series
X_mod = X.iloc[:-10]
# instantiate clusterer which can handle unequal time series
# all_estimators('clusterer', as_dataframe=True, filter_tags={'capability:unequal_length': True})
unequal_clst = TimeSeriesKMeansTslearn(n_clusters=no_of_unknown_clusters, n_jobs=-1, random_state=RANDOM_STATE)
# fit the clusterer
unequal_clst.fit(X_mod) # error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[119], line 18
15 unequal_clst = TimeSeriesKMeansTslearn(n_clusters=no_of_unknown_clusters, n_jobs=-1, random_state=RANDOM_STATE)
17 # fit the clusterer
---> 18 unequal_clst.fit(X_mod)
File c:\Users\agpgago\data-science\forecast-wizard-test\venv\lib\site-packages\sktime\clustering\base.py:110, in BaseClusterer.fit(self, X, y)
107 # reset estimator at the start of fit
108 self.reset()
--> 110 X = self._check_clusterer_input(X)
112 multithread = self.get_tag("capability:multithreading")
113 if multithread:
File c:\Users\agpgago\data-science\forecast-wizard-test\venv\lib\site-packages\sktime\clustering\base.py:422, in BaseClusterer._check_clusterer_input(self, X, enforce_min_instances)
420 unequal = not X_metadata["is_equal_length"]
421 self._check_capabilities(missing, multivariate, unequal)
--> 422 return convert_to(
423 X,
424 to_type=self.get_tag("X_inner_mtype"),
425 as_scitype="Panel",
426 )
File c:\Users\agpgago\data-science\forecast-wizard-test\venv\lib\site-packages\sktime\datatypes\_convert.py:263, in convert_to(obj, to_type, as_scitype, store, store_behaviour, return_to_mtype)
260 from_type = infer_mtype(obj=obj, as_scitype=as_scitype)
...
610 X_values = X_coerced.values
--> 611 X_3d = X_values.reshape(n_instances, n_timepoints, n_columns).swapaxes(1, 2)
613 return X_3d
ValueError: cannot reshape array of size 291990 into shape (200,1460,1)
For the above to work, I actually have to align the time period of all time series into the same range in order for the clusterer to work. Doesn't that defeat the purpose of having a clusterer that works for unequal length time series.
from sktime.transformations.panel.padder import PaddingTransformer
clst_pipe = PaddingTransformer() * unequal_clst
clst_pipe.fit(X_mod) # no errors
Expected behavior
Fitted clusterer on multiple time series which are unequal in length
Versions
System:
python: 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)]
executable: [c:\Users\agpgago\data-science\forecast-wizard-test\venv\Scripts\python.exe](file:///C:/Users/agpgago/data-science/forecast-wizard-test/venv/Scripts/python.exe)
machine: Windows-10-10.0.19042-SP0
Python dependencies:
pip: 23.3.2
sktime: 0.28.0
sklearn: 1.4.1.post1
skbase: 0.7.2
numpy: 1.26.4
scipy: 1.12.0
pandas: 2.1.4
matplotlib: 3.8.3
joblib: 1.3.2
numba: 0.58.1
statsmodels: 0.14.1
pmdarima: 2.0.4
statsforecast: 1.7.3
tsfresh: 0.20.2
tslearn: 0.6.3
torch: None
tensorflow: 2.15.0
tensorflow_probability: None
Yes, I think the tag is simply wrong, thanks for reporting.
I think the tag is wrong because this interfaces tslearn
, and tslearn
uses vanilla 3D numpy
internally, which implies unequal length.
Explanation why the tests did not catch this: there were no unequal length test scenarios for clustering, unlike for classificatoin. I'll fix that too.