Feature request: Handle overlapping samples in shapelet methods

Question

Feature request: Handle overlapping samples in shapelet methods

miguellacerda opened this issue 4 years ago · comments

A common approach to constructing a sample for time series classification is to take one long time series and break it into smaller, overlapping series which each have their own label. However, if you have overlapping sequences, the ShapeletTransform and LearningShapelets algorithms in pyts return the same shapelets, since these will occur in multiple samples. Setting remove_similar = True in ShapeletTransform does not resolve this since it only excludes similar shapelets "taken from the same time series". It would be great if these algos would consider only a single instance of a shapelet that is shared across multiple time series samples. At the moment, I manually remove identical shapelets from the output (so that the number of shapelets I actually end up with is < n_shapelets).

Johann Faouzi · Answer 1 · Sat Aug 01 2020 21:48:08 GMT+0800 (China Standard Time)

The shapelet transform already performs the overlapping subsequences approach, and the learning shapelet classifier does not really need it because it learns the features (shapelets) itself.

The original paper that introduced the shapelet transform uses the following definition of self-similarity:

We define two shapelets as being self-similar if they are taken from the same series and have any overlapping indices.

I also agree that this is not ideal, because if a shapelet is discriminative for a class, it will be picked up for each time series belonging to this class in the training set.

For the learning shapelet classifier, there is no such constraint because the shapelets are learned, and we could expect that the optimization procedure (gradient descent) will converge to a point where shapelets are different from each other, otherwise some shapelets are redundant. I think that it should be less an issue for the learning shapelet classifier than for the shapelet transform.

Are shapelets really identical or really close (according to a metric)? The shapelet transform basically consists in:

Evaluating all the possible shapelets
(Optional) Removing all the self-similar shapelets
Returning the n_shapelets most discriminative shapelets

So if you have a lot of RAM, you can set n_shapelets to a very large value so that only step 1 is performed, and you can select the shapelets yourself.

I don't really like the definition of self-similarity of the shapelet transform algorithm, but it's what they used in their paper and I try to stick to what is published in the literature. I think a metric-based approach (Euclidean distance for same-length shapelets, it may be more complicated for different-length shapelets since DTW has a quadratic computational complexity) could be better.

Hope this helps you a bit.

miguellacerda · Answer 2 · Mon Aug 03 2020 14:56:41 GMT+0800 (China Standard Time)

Thanks for your response, Johann. I'm not sure that you have understood my issue (although I might have misunderstood you). Let me give an example:

Consider a time series x = [0,1,2,3,4,5,6,7,8,9] with binary labels at each time point given by y = [0,0,1,1,1,0,0,0,1,1]. Let's suppose that we want to predict the label y given the previous 5 values of x. We could construct a training sample as follows:

[0,1,2,3,4] --> 1
[1,2,3,4,5] --> 0
[2,3,4,5,6] --> 0
[3,4,5,6,7] --> 0
[4,5,6,7,8] --> 1
[5,6,7,8,9] --> 1

The subsequence [3,4,5] is a strong predictor of the label in this case. The ShapeletTransform algo will return this shapelet three times, since it doesn't consider the [3,4,5] in sequence (2) to be similar to the [3,4,5] in sequences (3) or (4). As you point out, self-similarity is only defined within each example.

You can see this for yourself:

X = [[0,1,2,3,4],
     [1,2,3,4,5],
     [2,3,4,5,6],
     [3,4,5,6,7],
     [4,5,6,7,8],
     [5,6,7,8,9]]

y = [1,0,0,0,1,1]

from pyts.transformation import ShapeletTransform
st = ShapeletTransform(n_shapelets = 5,
                       criterion = 'anova',
                       window_sizes = [3],
                       window_steps = [1],
                       remove_similar = True,
                       sort=True) 
                       
st.fit(X, y) 
st.shapelets_
# returns [3,4,5] three times!

Johann Faouzi · Answer 3 · Mon Aug 03 2020 16:20:10 GMT+0800 (China Standard Time)

In the literature, time series classification usually refers to a problem for which each time series has a label, that is one wants to predict y given x. You transform your problem into a time series classification task, but your original problem is not a time series classification task.

Your objective is to predict y_t given (x_1,...,x_t) and (y_1,...,y_{t-1}), where x_i is a vector of features at time i.
In the literature, your problem is usually referred to as a regression model for binary time series. Here are a few papers that I found when doing some literature research a while ago:

Now I better understand your issue, but I don't know if addressing it is in the scope of this package since your original problem is not a time series classification task. You say that:

At the moment, I manually remove identical shapelets from the output (so that the number of shapelets I actually end up with is < n_shapelets)

but it could be done automatically. If you just want to remove duplicates, you can use the drop_duplicates() method of a pandas.DataFrame:

import pandas as pd

X = [[0,1,2,3,4],
     [1,2,3,4,5],
     [2,3,4,5,6],
     [3,4,5,6,7],
     [4,5,6,7,8],
     [5,6,7,8,9]]

y = [1,0,0,0,1,1]

from pyts.transformation import ShapeletTransform
st = ShapeletTransform(n_shapelets = 5,
                       criterion = 'anova',
                       window_sizes = [3],
                       window_steps = [1],
                       remove_similar = True,
                       sort=True) 
                       
st.fit(X, y)
pd.DataFrame(st.shapelets_).drop_duplicates().to_numpy()
# returns [3,4,5] one time :)

But since you will remove duplicates, you should set n_shapelets to a larger value than the number of shapelets that you want, then remove the duplicates, and finally select the top-n shapelets (where n is the number of shapelets that you want).

miguellacerda · Answer 4 · Mon Aug 03 2020 16:45:02 GMT+0800 (China Standard Time)

Thanks, Johann. I'll certainly check out the references you provided. However, in my case, it is reasonable to assume that y_{t} is conditionally independent of (y_{t-1}, y_{t-2}, ...) given (x_{t}, x_{t-1}, ...). I therefore think that this does qualify as a time series classification problem. Either way, I'll simply drop duplicates as you suggest. Thanks so much for your suggestions!