sktime / sktime

A unified framework for machine learning with time series

Home Page:https://www.sktime.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[ENH] Using custom distance measure with `TimeSeriesKMeans` (`distance_matrix_fast` of `dtaidistance`)

nils-fl opened this issue · comments

commented

Is your feature request related to a problem? Please describe.
It's not a real problem, I would like to speed up dtw pairwise distance measure to work more comfortably with larger datasets.
See discussion here: #6301

Describe the solution you'd like
Different solution paths were pointed out in the discussion (see link above).
TimeSeriesKMeans uses pairwise_distance which should have an equivalent output to distance_matrix/distance_matrix_fast of dtaidistance?
Looking at the code I feel a bit overwhelmed though ...

Additional context
I compared the runtime of pairwise_distance with distance_matrix_fast using a dataset of 20_000 arrays of length 40.

  • pairwise_distance ~2 min
  • distance_matrix_fast ~5 sec

Note to add: looking into the code, this will probably work only for distances for equal length time series.

That is because initialization and averaging subroutines assume equal length - it is not well-defined, without a substantial extension of internal logic, what to do if unequal length time series are encountered.

commented

You are right - it won't be easy to implement in a nice way. Actually, momentarily, I doubt that it's worth the effort ...

I could check, perhaps there's a way to extend it for equal length distances.

Is your use case equal length, or unequal length?

commented

Sorry for my late responses!
I am using equal length.
Thank you for your willingness to review the code in more detail.

commented

Would a simple workaround be to pass the distance matrix directly to the BaseEstimator?
And in general, would it make sense to have a distance_matrix parameter in TimeSeriesKMeans which - if not None - would provoke a skip of the distance computations and go directly to KMeans?

It would unlock using all kinds of future libraries to compute distances and it's up to the user to add weighting or whatever in a custom way... without sktime to include those libraries.

And in general, would it make sense to have a distance_matrix parameter in TimeSeriesKMeans which - if not None - would provoke a skip of the distance computations and go directly to KMeans?

Hm, it would, but that you can already do by going directly to the KMeans in sklearn. I.e., you can compute your distance matrix and just pass it to KMeans.fit, from sklearn. The added property of the sktime one is adhering to the time series interface - which you do not need if you pre-compute distances.

Could you explain perhaps where you see the difference or benefit to just calling KMeans?

commented

You are right. From my point of view it doesn't really make sense anymore to implement the dtaidistance path as it's too much work for something that can easily be done via "distance matrix" + "any kMeans library".

Should I close this issue or are you interested in further investigations?

I think we shouldn't close - distance matrix + kmeans is memory intensive, and will give problems for large data sets, since the entire distance matrix needs to be stored in memory - either completely fail due to running out of memory, or being very slow.

This approach has caused some issues to users with large data sets previously, for knn: #5937

The same issue would arise with kmeans, and we should find a way to pass custom distances to be used internally, iteratively, without storing the entire distance matrix in memory.

Therefore, I would say we keep this open with the scope as originally intended?

commented

Oh ... I totally underestimated the size of the distance matrix for large datasets!
Just ran a test and there is no chance to keep it in memory when approaching ~100.000 series with length 100 which is not even super large.

Well, that's squared scaling for you... let me have a look after the release how easy it would be to modify the existing estimator.