[ENH] Composition forecaster that clusters time series and forecasts by cluster

Question

[ENH] Composition forecaster that clusters time series and forecasts by cluster

ggjx22 opened this issue 2 months ago · comments

Is your feature request related to a problem? Please describe.
Currently, there is no feature which estimates (provides) indicators based on past behavior or trends of a time series data set. Visual inspection is required before deciding how to best choose the appropriate preprocessing and model in order to produce forecast, be in direct or recursive.

Describe the solution you'd like
To have a feature which estimate the forecastability of a time series by labeling them as smooth, erratic, lumpy, intermediate, unclassified (or just integer based labels. Same as what the clusterers of sktime are providing now via get_fitted_params). There are already existing studies that shows can this can be implemented for demand forecasting. But this may lead to misclassification if data contains both positive and negative values (see calculation from links below). Maybe passing the data through a MinMaxScaler solves the issue, any thoughts?

Measurement of ADI and CV²

Another website discussing measurement for forecastability:

Approximate Entropy and Sample Entropy

Finally base on the result this feature provides, the user decides whether and how to preprocess, aggregate multiple time series to increase forecastability, or model the data, etc.

Franz Király · Answer 1 · Tue Apr 09 2024 17:37:20 GMT+0800 (China Standard Time)

I think this is a very interesting question how one would support this on a framework level, as it links clustering and forecasting!

Let's consider the following algorithm:

first, apply a clusterer to a panel of time series, sorting them into categories like lumpy, erratic, chewie, etc.
then, apply a different forecaster per category
possibly, on the above, tune which forecaster is appled on which category, or the parameters of the clusterer

To deal with this programmatically, I think we need:

a compositor that allows to take a "grouping" from a clusterer or primitives transformer, and applies a forecaster by class/category, similar to HierarchyEnsembleForecaster
possibly, polymorphism for clusterers that are capable of producing labels, to behave as to-primitives transformers
the concrete estimator that @ggjx22 is suggesting, to categorize time series into different categories

The compositor, and the concrete estimator might be nicely defined contribution projects.

Franz Király · Answer 2 · Tue Apr 09 2024 17:40:12 GMT+0800 (China Standard Time)

@ggjx22, let me know if I understood this right!

Code might look as follows

ts_clusterer = TsCategorizer()

forecast_by_type = ForecastByGroup(
    grouper = ts_clusterer,
    forecasters = {
        "lumpy": LumpyForecaster(42),
        "erratic": ErraticForecaster("foo"),
        "chewie": ChewieForecaster("bar"),
    }
)

and we could tune this with ForecastingGridSearchCV.

ggjx22 · Answer 3 · Tue Apr 09 2024 17:53:01 GMT+0800 (China Standard Time)

@fkiraly If you are suggesting an overall pipeline then it looks about right. What I hope the feature will give me are labels or scores that indicates forecastability, which in the case of the code above, comes from TsCategorizer.

For example, the air passenger dataset has high forecastability, through some calculation a label or score is given. Whereas another label or low score for a dataset which does not have any historical seasonal/trend patterns that a model can extrapolate.

Franz Király · Answer 4 · Tue Apr 09 2024 17:57:16 GMT+0800 (China Standard Time)

do you have any references or proposed estimators that could be implemented here? The design allows different clusterers or transformations to be passed as grouper.

ggjx22 · Answer 5 · Tue Apr 09 2024 18:00:41 GMT+0800 (China Standard Time)

Unfortunately no at this current moment

Marc Rovira · Answer 6 · Wed Apr 10 2024 14:16:03 GMT+0800 (China Standard Time)

Oh this sounds very interesting! I've hacked my way around this same problem in the past (with ADI/CV2 for demand forecasting), albeit in a more ad hoc basis. Love the idea of supporting the workflow at a framework level. Maybe I could help out? Just subscribed to the issue. I think this would be a very nice addition to sktime as it also showcases the power of a complete framework for time series (i.e., being able to string together clustering with forecasting) is much more valuable than a library with just forecasting algorithms.

Franz Király · Answer 7 · Wed Apr 10 2024 20:11:08 GMT+0800 (China Standard Time)

Nice, @marrov!

Would you like to give it a try?
I would start by inheriting from BaseForecaster, ignoring _HeterogenousMetaEstimator for the start, and going by the extension template.

I'd make the arguments a clusterer and a dict of forecasters.

The main alternative I see, is starting with a _HeterogenousMetaEstimator instead, and have a list of pairs instead of a dict. Perhaps that's better because closer to existing patterns?

Perhaps HierarchyEnsembleForecaster is the closest, to that.

Franz Király · Answer 8 · Wed Apr 10 2024 20:11:23 GMT+0800 (China Standard Time)

FYI @VyomkeshVyas, because of HierarchyEnsembleForecaster

ggjx22 · Answer 9 · Thu Apr 11 2024 13:03:41 GMT+0800 (China Standard Time)

Hello @marrov or anyone with the relevant experience. Would like to seek your advices since you have worked with ADI/CV2 before. In cases where there could be negative demand, what will you do when calculating the CV? The positive and negative values will cancel out each other. If you take the absolute values of the time series, will it lead to misclassification?

For my case, I am trying to borrow this concept in demand forecasting for a quick and dirty approach to analysis time series in the financial space - accounts with positive and negative values represent credit/debit. Clustering as an analysis approach will make more sense here but I am currently running into unequal length issues.

Franz Király · Answer 10 · Thu Apr 11 2024 17:11:56 GMT+0800 (China Standard Time)

@ggjx22, may I ask what does "negative demand" mean in your case? If we talk about sales, the numbers are by definition non-negative, except if you take product returns or refunds into account (where a return ends up in a different time bin than the sale), but that´s more of an artefact than "genuine" negatives, because customers cannot return more products than they bought.

ggjx22 · Answer 11 · Thu Apr 11 2024 17:25:54 GMT+0800 (China Standard Time)

@fkiraly For my case, think of "negative demand" as transactional values being debited (incoming funds) or credited (outgoing funds) depending on the nature of the account. See sample plot attached.

Franz Király · Answer 12 · Thu Apr 11 2024 23:39:55 GMT+0800 (China Standard Time)

I see. adi/cov are summaries for event or count processes, and your data seems marked event, i.e., "transaction happens" plus "value of transaction"

I can think of a few options off the top of my head:

apply adi/cv to transaction count, that's an integer
apply to absolute value
use a measure of skew/kurtosis appropriate for distributions with mean close to zero, e.g., sample skew or kurtosis, or normalized variants

One question - would you like to contribute an adi/cv transformation to sktime? I don't think there's one available except through composition or generic pandas?

Also, out of interest: 2020 spike is probably covid, but what is 2016? US elections?

Franz Király · Answer 13 · Fri Apr 12 2024 00:15:00 GMT+0800 (China Standard Time)

added issue with specification of a transformation: #6286

ggjx22 · Answer 14 · Fri Apr 12 2024 10:08:01 GMT+0800 (China Standard Time)

One question - would you like to contribute an adi/cv transformation to sktime? I don't think there's one available except through composition or generic pandas?

I can try but I have no experience with open source contributions. Also it may take some time for me to warm up due to work commitments.

Also, out of interest: 2020 spike is probably covid, but what is 2016? US elections?

Hmm, not sure about that, I wasn't with the company during then. I could ask but it would just be one of the many unique one-off occurrence. Could also be data quality issues.

Aside to that, if an outlier exist during any of the cross validated folds backtesting, if treated (say, drop and interpolated), will that be introducing some form of bias training to the model? Assuming you do not know if those spikes will be happen again for the next x months (forecast horizon). I find this topic very subjective when discussing with people from the business, they would think the model training procedure is based off "fake" actual data.

Franz Király · Answer 15 · Fri Apr 12 2024 22:42:31 GMT+0800 (China Standard Time)

I wasn't with the company during then. I could ask but it would just be one of the many unique one-off occurrence

No worries, it's not necessary for any techincal discussion. I was just curious.

Aside to that, if an outlier exist during any of the cross validated folds backtesting, if treated (say, drop and interpolated), will that be introducing some form of bias training to the model?

Many time series models assume stationarity or a similar type of "continuous regime" to function.

There is no way any model can deal with "black swan" events, i.e., unique, consequential changes to the system state that have not occurred before in similar form, except by flagging that they are occurring.

Imo the safest is to have a component that detects departure from previously observed data ranges, and using that as a recommender on how well the model can be trusted (or not).

ggjx22 · Answer 16 · Mon Apr 15 2024 20:41:22 GMT+0800 (China Standard Time)

When you say "component" that detects, do you mean like an exogenous feature? Something like days before or days after those occurrences (genuine outlier)?

Franz Király · Answer 17 · Mon Apr 15 2024 21:25:40 GMT+0800 (China Standard Time)

No, I mean an algorithm subroutine. That is, you have an algorithm that forecasts, but it also has a warning routine that tells you when it believes forecasts are getting unreliable.

Shlok Sabarwal · Answer 18 · Wed May 15 2024 00:13:53 GMT+0800 (China Standard Time)

@fkiraly, Can I work on this issue? Thank you! I mean in reference to the 3-step plan you discussed. Having implemented Step 1, I'd love to work on the Step 2!

Franz Király · Answer 19 · Wed May 15 2024 02:47:04 GMT+0800 (China Standard Time)

Yes, absolutely! All yours!

@ggjx22, ADI-CV feature is implemented in #6336 and will be in the next release.