johannfaouzi / pyts

A Python package for time series classification

Home Page:https://pyts.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WEASEL+MUSE with Samples of Different Lengths

lambdatascience opened this issue · comments

If each sample is not the same length of n_timestamps, is it possible to perform WEASEL+MUSE on that dataset?

If I construct an array of (n_samples, n_features, n_timestamps) with n_timestamps that aren't equal, I get an exception from the validation of the array.

If I pad the samples with None or if I pad with some constant value, I get an exception (NaNs not allowed / quantiles equal).

Is there a solution/workaround to this problem, or am I chasing something that isn't allowed?

Thank you for a great package!

from pyts.multivariate.transformation import WEASELMUSE

# Create Fake Simple Dataset
# 3 Samples, 2 Features, Different n_timestamps
X = [[[3, 1, 0, 2, 2], [2, 2, 3, 3, 5]],
     [[0, 1, 0, 5, 2, 4], [3, 0, 0, 2, 4, 2]],
     [[1, 0, 1, 3, 5], [1, 3, 1, 2, 3]]]
y = [1, 0, 1]

transformer = WEASELMUSE(word_size=4, n_bins=2, window_sizes=[8],
                         chi2_threshold=15, sparse=False)
X_new = transformer.fit_transform(X, y)

Hi,

Thank you for your interest in pyts. Data sets of variable-length time series are unfortunately poorly supported for the moment, because of the following reasons:

  • The data sets commonly used to benchmark the algorithms (UEA & UCR Time Series Classification Repository) only contained fixed-length time series data sets for a very long time, thus most algorithms were developed for fixed-length time series data sets.
  • It's easier and more efficient to work on fixed-length time series data sets with n-dimensional numpy arrays.

That being said, I think that they are still ways of dealing with variable-length time series data sets. Some dummy ways would be to truncate or pad (with a real number) the time series to make them fixed-length time series data sets.

WEASEL+MUSE is basically WEASEL applied to each feature and to the derivatives of the time series independently. So, if the lengths of the time series are identical for a given feature, then you're good to go.

I may provide you a better answer and a code example if I know why the time series have different lengths and the approximate size of your data set.

Hope this helps you a bit.

Thanks for getting back to me.

It's definitely understandable why the algorithms are the way they are. The problems have always asked for it that way.

The problem I'm pursuing is:

Many features measured through time, going into a target state and out of a target state. So only a binary classification (True or False). Each time the features go into state (True), that length of time is obviously different than the time it's out of state previously (False). I had chopped up each time series into samples of True and False.

For example:

Features A and B, measured every minute for a day. Let's say it goes into state from 07:00 to 10:00 and 14:00 to 15:00.
This would be 5 samples:
X = [[Sample 1, 2 features, 420 samples from 00:00 to 07:00],
[Sample 2, 2 features, 180 samples from 07:00 to 10:00],
[Sample 3, 2 features, 240 samples from 10:00 to 14:00],
[Sample 4, 2 features, 60 samples from 14:00 to 15:00],
[Sample 5, 2 features, 540 samples from 15:00 to 00:00]]
y = [0, 1, 0, 1, 0]

Wanting to transform this data to then build a classifier of True/False states.

I have tried padding the time series with 0's to make them all the same length, but an exception happens:

~/lib/python3.6/site-packages/pyts/approximation/mcb.py in _compute_bins(self, X, y, n_timestamps, n_bins, strategy)
    209             if np.any(np.diff(bins_edges, axis=0) == 0):
    210                 raise ValueError(
--> 211                     "At least two consecutive quantiles are equal. "
    212                     "Consider trying with a smaller number of bins or "
    213                     "removing timestamps with low variation."

ValueError: At least two consecutive quantiles are equal. Consider trying with a smaller number of bins or removing timestamps with low variation.

I hope that makes sense. The scale of data is roughly ~600 samples, 50 features, timestamps ranging from 60-150 in length.

Thank you for your detailed answer. Concerning the error, I should get rid of it because it is too restrictive. Basically this error occurs because the Fourier coefficients are all equal to 0, and binning a constant variable with more than 1 bin is a bit weird.

Maybe your example is a bit too simple, but if the features can go into the target state only every hour, and if you make the assumption that going in and out of the target state only depends on the previous hour, you would end up with one sample for each hour, and all the time series would have the same length.

One important issue, in your formulation, is that you use the target state to define the samples (i.e. how to split the whole time series into sub-series). When you will apply the algorithm on new unseen data, you won't be able to split the whole time series because you can't use the labels of the test samples (that's "cheating").

To me, your task looks more like a regression for binary time series. You have a target binary time series y = (y_1, ..., y_n) and you have features X = (x_1, ..., x_n) where x_t is the vector of features at time t. You want to predict the future state of the target in the future y_{t+tau}, where tau is how much in advance you want to predict, given the past information (y_1, ..., y_t) and (x_1,...,x_t).

I discussed about this in the previous issue #80. You may find some relevant information in the discussion.

Let me know if this helps you.

I think my simple example didn't illustrate the problem, my apologies. Please let me try again.

I'm building a classifier on a set of time series, to classify whether a future, examined time is a target state (True) or not (False). I thought about training this classifier (for example, scikit's RandomForestClassifier) using the X generated by the WEASEL+MUSE transform. Once trained, the classifier could then be applied to future data to give a probability it's in state or not (e.g. predict_proba from RandomForestClassifier).

The time series were labeled as in-state or out of state for all time. So they have un-even boundaries (for example, in-state from 01:34 to 03:15, not in state 03:15 to 09:32, etc). That was the original reason to use the samples with varying n_timestamps.

I'm not trying to predict the actual values of the time series, just classify them. I thought the words generated by W+M could be robust features in the training of this classifier. Do you have any experience using this for a problem like that? Do you even think W+M is appropriate for it? Thanks.

As you said, your first example was maybe too simple, but from what I understood, your data set looks like you have a target state Y that you want to predict given several features X_1,...X_N. If you have measurements every minute, your data looks like this:

Time Y X_1 ... X_N
00:00 0 3 ... 6
00:01 0 4 ... 5
00:02 0 2 ... 8
00:03 1 4 ... 6
00:04 1 5 ... 4
00:05 0 3 ... 5
00:06 0 4 ... 5
00:07 0 3 ... 6
00:08 0 3 ... 5
00:09 0 5 ... 7
... ... ... ... ...

So you have 1 + N univariate time series:

  • One that you want to predict, Y, which is binary (True or False only),
  • N time series that can be used to predict Y.

You split the whole sequence of measurements based on the value of the target state and define the label as the future value of the target:

  • Sample 1: Y=1, X_1 = [3, 4, 2], ..., X_N = [6, 5, 8]
  • Sample 2: Y=0, X_1 = [4, 4], ..., X_N = [6, 4]
  • Sample 3: Y=1, X_1 = [3, 4, 3, 3, 5, ...], ..., X_N = [5, 5, 6, 5, 7, ...]

Am I wrong? To me, your target state is a binary time series and you are trying to predict its future value.

No your description of the problem is correct, and re-reading your last post I understand more. Let me look more into the other post, and that topic. Thank you so much for your time!