johannfaouzi / pyts

A Python package for time series classification

Home Page:https://pyts.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SAX-VSM, constant time series error

Siniara opened this issue · comments

Description

When running SAX-VSM on my timeseries I get the following error:
At least one sample is constant.

I tried filtering out all the constant time series with
X = X[np.where(~(np.var(X, axis=1) == 0))[0]]
to no avail

I tried fitting the model on 1 non-constant array and still got the error. I think that the issue is that this error is thrown when the SAX approximation would give the same symbol for the window, thus meaning that the window is constant. E.g. for a wordsize of 3 if the SAX transform would yield 'aaa' then this error appears. Could it be the case?

Steps/Code to Reproduce

<< 
from pyts.classification import SAXVSM

X_train = np.array([[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2]])
y_train = np.array([1]) 

clf = SAXVSM(window_size=0.5, word_size=0.5, n_bins=3, strategy='normal')
clf.fit(X_train, y_train)
>>

Versions

NumPy 1.20.1
SciPy 1.6.1
Scikit-Learn 0.24.1
Numba 0.53.1
Pyts 0.11.0

Additionally an error:
'n_bins' must be greater than or equal to 2 and lower than or equal to min(word_size, 26)

If n_bins represents the alphabet size/the paa approximation then why should it be lower than the word size?
Doesn't that mean that situations like alphabet = {a,b,c,d,e} and wordsize = 3, are impossible? (which shouldn't be the case)

Your intuition is somewhat right. The SAX-VSM algorithm works like this:

  1. A sliding window is used to extract overlapping subsequences of a time series.
  2. Each subsequence is standardized, reduced (with PAA) and discretized (with SAX) to transform it into a word.
  3. The bag of words is reduced with numerosity reduction (optional).
  4. The bags of words for time series belonging to the same class are merged together.
  5. The TF-IDF matrix is computed.

The issue comes from step 2: when a subsequence is constant, not only it cannot be standardized (zero variance), but also the choice of the symbol would be subjective. For instance, for a subsequence of length 4 with n_bins=3, the discretized sequence could be "aaaa", "bbbb" or "cccc".

So, basically, as long as there is a constant subsequence that is longer than window_size, the error will be raised. Since this issue has been raised several times (because it's common to have constant subsequences in a time series), I think that there are two possible solutions: discarding the constant subsequences or counting them (using the first symbol to create the corresponding word for instance).

Thank you for the reply! It's a very interesting point you make about the constant subsequences that I hadn't noticed mentioned in the related works for the method.

Do you perhaps know how the issue is dealt with in the saxpy package made by one of the developers of the SAX-VSM method? When I just run a SAX transformation from that package I do get words like 'aaa' and 'ccc', but then you're saying that they are the same and/or arbitrarily chosen? So these constant patterns could never be important features that discriminate between classes?

Additionally, I'm confused about how exactly step 2 works. If we standardize each subsequence locally then how can we say that 'abc' in one means the same as 'abc' in another one? I thought that one word corresponds to one pattern and that's why they are treated as the same feature? But in reality the raw numbers behind the 'abc' are different depending on the subsequence?

I'm working with the SAX method to see if I can extract interpretable patterns from my timeseries, I thought that if I discover that 'abc' is an important feature for classification that then I would be able to go back to the original data and see what this 'abc' means in numbers, is this then not the case?

The original paper states that, if the standard deviation of a subsequence is too low, the standardization step is skipped (section IV.1):

Capture d’écran 2021-08-23 à 09 43 54

So in this case, I guess that they are using the raw values of the subsequences, and you can get different words depending on the constant value of the subsequences (e.g., [-1, -1, ..., -1] would be transformed into "aaa" while [1, 1, ..., 1] would be transformed into "ccc"). They also assume that the input time series is standardized (zero mean, unit variance).

Do you perhaps know how the issue is dealt with in the saxpy package made by one of the developers of the SAX-VSM method?

Could you give me the code you're using with saxpy? It would help me dig into their code to see what is implemented.

When I just run a SAX transformation from that package I do get words like 'aaa' and 'ccc', but then you're saying that they are the same and/or arbitrarily chosen? So these constant patterns could never be important features that discriminate between classes?

Additionally, I'm confused about how exactly step 2 works. If we standardize each subsequence locally then how can we say that 'abc' in one means the same as 'abc' in another one? I thought that one word corresponds to one pattern and that's why they are treated as the same feature? But in reality the raw numbers behind the 'abc' are different depending on the subsequence?

I'm working with the SAX method to see if I can extract interpretable patterns from my timeseries, I thought that if I discover that 'abc' is an important feature for classification that then I would be able to go back to the original data and see what this 'abc' means in numbers, is this then not the case?

The assumption behind the standardization of each subsequence is that the local mean and variance do not matter. You can a pattern by transforming a word into a vector (although you lose some information, because all the values falling in the same bin are transformed into the same letter), this pattern is independent of its mean and its variance. So, if p is your pattern (a vector of real numbers), then a + b * p for any real numbers a and b also corresponds to this pattern.

If you really want a single pattern, you should have a look at shapelet-based algorithms. A shapelet is simply a real-valued vector and the distance between a time series and a shapelet is defined as the minimum squared Euclidean distance between the shapelet and all the subsequences extracted from the time series with the same length:

Capture d’écran 2021-08-23 à 09 59 55

I really appreciate your informative answer! I'm a Master's student doing a project with the SAX transformation, so I'm just learning about this method.

Indeed, I'd missed the bit about not applying the normalization, and it makes sense in light of the points you've already mentioned, but so in pyts implementation of SAX-VSM this step is not happening? As in if the normalization can't be applied instead of not normalizing the error is thrown?

Now I had a light-bulb moment and I understood that saxpy (https://github.com/seninp/saxpy) allows the user to specify this threshold z_threshold and so that's how it deals with the constant subsequences. I'm not sure what you meant with sharing the code. I'm using the sax_via_window function from saxpy. I wanted to run some experiments with the SAX-VSM method yet saxpy seems to be discontinued so I was happy to find this project.

The assumption behind the standardization of each subsequence is that the local mean and variance do not matter. You can a pattern by transforming a word into a vector (although you lose some information, because all the values falling in the same bin are transformed into the same letter), this pattern is independent of its mean and its variance. So, if p is your pattern (a vector of real numbers), then a + b * p for any real numbers a and b also corresponds to this pattern.

Sorry, I'm not entirely sure I understand. What do you mean with transforming a SAX word into a vector. Since the letters are essentially bins for the normal distribution they correspond to a range of values not a single value?

Indeed, I'd missed the bit about not applying the normalization, and it makes sense in light of the points you've already mentioned, but so in pyts implementation of SAX-VSM this step is not happening? As in if the normalization can't be applied instead of not normalizing the error is thrown?

That's an oversight on my part, I also did not pay attention to this when I wrote the implementation, and I want to fix it! The normalization step could definitely be skipped for the the subsequences that have a standard deviation below a threshold.

I looked at the code in saxpy for the z-normalization and it seems like that they still center the subsequence even if they don't divide it by its standard deviation. So I don't get (for the moment) how you could get words like "aaa" or "ccc" if the subsequences are centered: the mean of the centered subsequence is zero, and its standard deviation is low, so all the values must be close to zero, and you should get a word with only the "middle" symbol ("bbb" if you use 3 symbols). And it does not really match what is written in the original paper.

By sharing the code, I meant which function in saxpy you used (sax_via_window) with which parameters (and possibly the input time series if you can share it, or a toy time series that illustrates the same behavior). I'm going to play a bit with the function to see if I can reproduce the same behavior and understand why it occurs.

Sorry, I'm not entirely sure I understand. What do you mean with transforming a SAX word into a vector. Since the letters are essentially bins for the normal distribution they correspond to a range of values not a single value?

Yes, they correspond to a range values, and an approximation would simply consist in taking the center of the interval (or any real number in the interval). At the end, if you really want to find a real-valued pattern, you have to make some kind of approximation to go back to real numbers from symbols/bins.

So I did some research and the latest release of saxpy on PyPI is from March 11, 2018, which corresponds to the following source code.

In this version, if the standard deviation of the subsequence is below a given threshold, the subsequence is neither centered nor scaled. That's why you can get words like "aaa", "bbb" and "ccc" for (almost) constant subsequences, the symbol depending on the values of the subsequence. With the current version (master branch), the subsequences are always centered, and I only get "bbb" words.

Since the original paper introducing SAX-VSM was published in 2013 and the description in the paper matches the latest release on PyPI, I think that I will implement SAX-VSM this way (and possibly add a parameter to allow for centering subsequences with a standard deviation below the threshold).

Good catch!

The time series I'm using is relatively long to share, when I try to make it smaller/downsample I don't get the same patterns anymore (unsurprisingly), so it's hard to share reproducible code. Yet I observed that I can get multiple constant SAX words if I have NaNs in my time series. I'm not quite sure how saxpy deals with them. If I drop them I do seem to get no more than one constant SAX word (which could be aaa or ccc), but I haven't checked all of them.

I'm using SAX for time series classification. I'm working with multivariate time series and for the moment I use sax_via_window from saxpy on each of my features separately and then combine the extracted words with an identifier that shows which series it came from into a single vector (document/learning instance). Then I apply a tf-idf scoring on the resulting matrix. I'm trying to imitate the SAX-VSM method for the multivariate case. I'm not quite successful in creating the final matrix I think, which only has vectors representing the classes (not the individual instances). As I understood the goal for SAX-VSM is to combine all patterns found from a single class, so I just tried summing the patterns found across a class and then apply tf-idf scoring on the resulting matrix, which has 2 rows in my instance as I have 2 classes. In my attempts I get a much lower classification accuracy with that and cosine similarity than if I just apply a ML classifier on tf-idf matrix with all the individual instances.

Inspired by you I also took a closer look at saxpy source code and noticed that it seems to support multidimensional time series in the sax_via_window param sax_type even though it's not documented (so not sure if/how it's working), I may give it a go. Is there something similar in pyts ?

The time series I'm using is relatively long to share, when I try to make it smaller/downsample I don't get the same patterns anymore (unsurprisingly), so it's hard to share reproducible code. Yet I observed that I can get multiple constant SAX words if I have NaNs in my time series. I'm not quite sure how saxpy deals with them. If I drop them I do seem to get no more than one constant SAX word (which could be aaa or ccc), but I haven't checked all of them.

SAX-VSM applies numerosity reduction, a technique that only keeps a single instance of identical consecutive words.

Capture d’écran 2021-08-24 à 13 26 02

I'm using SAX for time series classification. I'm working with multivariate time series and for the moment I use sax_via_window from saxpy on each of my features separately and then combine the extracted words with an identifier that shows which series it came from into a single vector (document/learning instance). Then I apply a tf-idf scoring on the resulting matrix. I'm trying to imitate the SAX-VSM method for the multivariate case. I'm not quite successful in creating the final matrix I think, which only has vectors representing the classes (not the individual instances). As I understood the goal for SAX-VSM is to combine all patterns found from a single class, so I just tried summing the patterns found across a class and then apply tf-idf scoring on the resulting matrix, which has 2 rows in my instance as I have 2 classes. In my attempts I get a much lower classification accuracy with that and cosine similarity than if I just apply a ML classifier on tf-idf matrix with all the individual instances.

SAX-VSM is not well known for its high predictive performance. Its upside is mostly scalability in terms of number of training time series and number of classes. The training phase is relatively simple (computing the TF-IDF matrix), and the inference phase is also simple (highest cosine similarity). Moreover, to make IDF useful, you need a lot of classes: for 2 classes, the document frequency can only take 2 values (1 or 2). So I'm not very surprised that it does not perform well.

I don't know if you have some requirements about the classifier that you want to use (interpretability, running complexity, etc.) but there are probably other relevant classifiers for your use case that may be worth trying out.

Inspired by you I also took a closer look at saxpy source code and noticed that it seems to support multidimensional time series in the sax_via_window param sax_type even though it's not documented (so not sure if/how it's working), I may give it a go. Is there something similar in pyts ?

Looking at the examples in the docstring of the sax_via_window function, there seems to be 3 possibles strategies: SAX-ENERGY, SAX-REPEAT and SAX-INDEPENDENT. I would guess that SAX-INDEPENDENT applies SAX independently to each feature. There is a reference for SAX-REPEAT in the README. No idea what SAX-ENERGY is (maybe SAX-ZSCORE that is mentioned in the README).

Multivariate time series are unfortunately poorly supported in pyts, mainly because the algorithms, as described in their original papers, almost always work on univariate time series only. Nonetheless, if you want to extend a transformer or a classifier in pyts in a dummy way, there are two convenient classes:

So, if you would like to use SAX-VSM for multivariate time series in pyts, you would have two options:

I will try to work on fixing the standardization issue in my implementation of SAX-VSM ASAP.

Your help is very much appreciated! For my problem the interpretability is key, hence SAX was a method I considered as then I could look at the patterns that are the most informative for classification and locate them/visualize them in the original timeseries. Also from domain knowledge I know that features from each individual time series won't be enough for class attribution, but rather a combination of them. The underlying problem is understanding why these series belong to one class or the other. Also as you mention many methods are developed for the univariate case, this huge bag of words method seemed to be a relatively straightforward way to circumvent that.

There's usually a trade-off between predictive performance: more complex models might perform better at the cost of a more complex interpretability. Also, I would rather interpret (with much caution) a complex model that performs well than a simple model that performs mediocrely.

Obviously this is your project, but if I were you, I would first focus on the predictive performance in order to have a rough estimation of how well you can predict the target of interest (class) given the input data (multivariate time series), without paying attention to the interpretability of the function mapping the input to the target. When this is done, I would compare the predictive performance of the different models and decide which model I pick (based on their predictive performance and interpretability). Finally, I would focus on the interpretability of this model.

Concerning the multivariate point, a simple point approach, if you use an algorithm that extracts information followed by a standard machine learning classifier, is to concatenate the information independently extracted for each feature and let the classifier deal with all the information at once.

Among the algorithms implemented that extract information, each variable (feature in machine learning terms) is relatively easy to interpret. Based on this, a general approach to evaluate the impact of a feature is permutation feature importance, which simply consists in shuffling a feature and see the difference in predictive performance (you need to take care of correlated features, it is mentioned at the bottom of the page).

The issue with constant subsequences in SAX-VSM should be fixed soon (I hope that I can merge #108 today).

Edit: the PR is merged and the fix is available on the master branch.

Thanks for the fix!

So now indeed the SAX-VSM doesn't throw that error anymore. However, when I tried to use the approach you suggested:

Use pyts.multivariate.transformation.MultivariateTransformer with pyts.bag_of_words.BagOfWords, which is the underlying class that performs (subsequence extraction + PAA + SAX + numerosity reduction + merge) to transform a time series into a bag of words.

I still got the same error about the constant samples.

This is my code, maybe I missed something? :

from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords
transformer = MultivariateTransformer(BagOfWords(window_size=20, word_size=3, n_bins=3), flatten=True)

X_new = transformer.fit_transform(X)

>>
151     def _check_constant(self, X):
152         if np.any(np.max(X, axis=1) - np.min(X, axis=1) == 0):
153             raise ValueError("At least one sample is constant.")
154 
155     def _compute_bins(self, X, n_samples, n_bins, strategy):
> ValueError: At least one sample is constant.

Alternatively, maybe it is possible to provide the Bag of Words representation to the SAX-VSM directly so that it only performs the classification of already transformed signals?

EDIT. I also tried with the threshold parameter, no change.

transformer = MultivariateTransformer(BagOfWords(window_size=20, word_size=3, n_bins=3, threshold_std=0.05), flatten=True)

Did you reinstall the package with the fix? With something like:

git clone https://github.com/johannfaouzi/pyts
cd pyts
pip install .

If you can use the threshold_std parameter, it means that you should have done it, but just to be sure. With the fix, the _check_constant method should not be called for subsequences that have a standard deviation below threshold_std:

# Identify subsequences whose standard deviation is below the threshold
idx = np.std(X_window, axis=1) < self.threshold_std
if np.any(idx):
# Subsequences with standard deviations below threshold
X_paa = PiecewiseAggregateApproximation(
window_size=None, output_size=word_size,
overlapping=self.overlapping
).transform(X_window[idx])
# Compute the bin edges
discretizer = KBinsDiscretizer(
n_bins=self.n_bins, strategy=self.strategy,
raise_warning=self.raise_warning
)
bin_edges = discretizer._compute_bins(X_scaled, n_samples,
self.n_bins, self.strategy)
# Tile the bin edges for each subsequence from the same time series
if self.strategy != 'normal':
count = np.bincount(
np.floor_divide(np.nonzero(idx)[0], n_windows)
)
bin_edges = np.vstack([
np.tile(bin_edges[i], (count[i], 1))
for i in range(count.size) if count[i] != 0
])
X_sax_below_thresh = alphabet[_digitize(X_paa, bin_edges)]

The following code works for me, could you try it?

from pyts.datasets import load_basic_motions
from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords


X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)

# Make a time series constant
X_train[0] = 0.

# Transform the dataset of multivariate time series
transformer = MultivariateTransformer(
    BagOfWords(window_size=20, word_size=3, n_bins=3), flatten=True
)
X_train_bow = transformer.fit_transform(X_train)
X_test_bow = transformer.transform(X_test)

Yes, I installed it as you outline, my current version is pyts 0.12dev0. The code snippet you provide works, but I just retried on my data and still the message about the constant time series.

What is the shape of your input data X? It's expected to be a 3D-array with:

  • the first dimension being for time series
  • the second dimension being for features (the different features of multivariate time series)
  • the third dimension being time.

In my code snippet, X_train.shape is (40, 6, 100), meaning that:

  • there are 40 multivariate time series,
  • each multivariate time series has 6 features, and
  • each time series has 100 time points.

My X.shape was (2396, 7, 51) -> number of multivariate samples/time series, 7 features and 51 measurements per feature. Which seems to correspond. I also tried a longer time series that is (2396, 7, 101) thinking that maybe it was too short, but the same still.

EDIT. I provide you the first 2 samples, so the data would be (2, 7, 51), maybe you manage to reproduce the error?

First two samples data = np.array([[[ 4.83199997e+01, 4.81500015e+01, 4.80499992e+01, 4.81699944e+01, 4.79900017e+01, 4.77200050e+01, 4.77500000e+01, 4.76300011e+01, 4.74599991e+01, 4.74750023e+01, 4.74500008e+01, 4.73199997e+01, 4.73600006e+01, 4.72200012e+01, 4.70099983e+01, 4.68899994e+01, 4.69033356e+01, 4.68800011e+01, 4.69000015e+01, 4.69750023e+01, 4.68300056e+01, 4.67900009e+01, 4.67599983e+01, 4.66800003e+01, 4.67700005e+01, 4.70200005e+01, 4.65900002e+01, 4.66599998e+01, 4.66399994e+01, 4.67299995e+01, 4.68000031e+01, 4.67500000e+01, 4.68600006e+01, 4.67266655e+01, 4.65000000e+01, 4.65800018e+01, 4.65499992e+01, 4.67500000e+01, 4.66899986e+01, 4.66699982e+01, 4.68499985e+01, 4.67400017e+01, 4.67599983e+01, 4.68499985e+01, 4.66749992e+01, 4.66699982e+01, 4.67000008e+01, 4.68499985e+01, 4.68199997e+01, 4.69099998e+01, 4.69599991e+01], [-3.50000000e+00, -3.50000000e+00, -3.50000000e+00, -3.50000000e+00, -3.50000000e+00, -3.40000010e+00, -3.40000010e+00, -2.59999990e+00, -2.50000000e+00, -2.29999995e+00, -2.20000005e+00, -1.89999998e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -2.59999990e+00, -3.20000005e+00, -3.20000005e+00, -3.20000005e+00, -3.09999990e+00, -3.00000000e+00, -3.00000000e+00, -2.70000005e+00, -2.70000005e+00, -2.50000000e+00, -2.50000000e+00, -2.29999995e+00, -2.69999599e+00, -2.94998217e+00, -2.20000005e+00, -2.20000005e+00, -2.29999995e+00, -2.09999990e+00, -1.89999998e+00, -1.79999995e+00, -1.89999998e+00, -1.89999998e+00, -1.79999995e+00, -1.89999998e+00, -1.79999995e+00, -1.89999998e+00, -1.89999998e+00, -1.89999998e+00, -2.00000000e+00, -2.20000005e+00], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [-3.76166612e-01, -3.10367614e-01, -4.65889990e-01, -2.14665994e-01, -2.50550002e-01, -3.04383397e-01, -1.24935642e-01, -3.94109130e-01, -1.78770006e-01, -7.11008534e-02, -2.86440015e-01, -1.78771287e-01, -3.52108553e-02, -3.22329998e-01, -1.06991708e-01, -1.42879993e-01, -1.30915329e-01, -7.10987151e-02, -1.42879993e-01, -8.90428573e-02, -7.10978583e-02, 3.65700014e-02, 7.24597871e-02, -2.50544876e-01, -7.10999966e-02, 1.86276734e-02, -1.78769365e-01, 7.24600032e-02, 6.80641737e-04, 7.24600032e-02, 6.78930432e-04, 3.65667902e-02, -3.52097861e-02, -1.30916759e-01, 1.44236580e-01, 1.08350001e-01, 1.08356416e-01, 1.44238934e-01, 1.26294672e-01, -3.52091454e-02, 7.24610686e-02, 7.24600032e-02, 1.08350001e-01, 1.80130005e-01, 7.24600032e-02, 7.24600032e-02, 3.65700014e-02, 7.24600032e-02, 7.24600032e-02, 3.65700014e-02, 1.44240007e-01], [ 3.23685735e-01, 5.86882949e-01, 3.59580010e-01, 3.95464867e-01, 3.65700014e-02, 3.95466805e-01, 2.87805140e-01, 5.39029121e-01, 6.10809982e-01, 3.05747986e-01, 2.16020003e-01, 1.80128068e-01, 1.80129781e-01, 2.51910001e-01, 1.98077142e-01, 2.87800014e-01, 3.23690563e-01, 3.95470440e-01, 5.03139973e-01, 5.03140867e-01, 3.05747688e-01, 3.59580010e-01, 2.16020644e-01, 3.65725681e-02, 2.51910001e-01, 2.16014653e-01, -1.78769365e-01, 2.16020003e-01, 3.41633409e-01, 2.87800014e-01, 2.87802130e-01, 3.95470649e-01, 1.44239575e-01, 1.68166474e-01, 1.44243419e-01, 3.59580010e-01, 3.59577447e-01, 2.51911074e-01, 2.51911938e-01, 2.69856274e-01, 4.13418740e-01, 3.23689997e-01, 3.59579563e-01, 3.23689997e-01, 2.16023430e-01, 2.16020003e-01, 6.79999997e-04, 1.44240007e-01, 2.16020003e-01, 3.59580010e-01, 4.31360006e-01], [ 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],
   [[ 4.79199982e+01,  4.77099953e+01,  4.81100006e+01,
      4.80999985e+01,  4.81399994e+01,  4.79900017e+01,
      4.79599991e+01,  4.79700012e+01,  4.80200005e+01,
      4.77400017e+01,  4.79799995e+01,  4.80400009e+01,
      4.80100021e+01,  4.80400009e+01,  4.81100006e+01,
      4.79449997e+01,  4.81800003e+01,  4.80999985e+01,
      4.80099983e+01,  4.81699982e+01,  4.80900002e+01,
      4.79500008e+01,  4.81899986e+01,  4.80699959e+01,
      4.86100006e+01,  4.82500000e+01,  4.82700005e+01,
      4.81300011e+01,  4.81199989e+01,  4.81800003e+01,
      4.82099991e+01,  4.80200005e+01,  4.80999985e+01,
      4.80800018e+01,  4.83100014e+01,  4.83199997e+01,
      4.81999969e+01,  4.82599983e+01,  4.82899971e+01,
      4.83600006e+01,  4.83399963e+01,  4.82599983e+01,
      4.82500000e+01,  4.80600014e+01,  4.80299988e+01,
      4.80499992e+01,  4.79900017e+01,  4.79099998e+01,
      4.74199982e+01,  4.72599983e+01,  4.71899986e+01],
    [-8.00000012e-01, -8.00000012e-01, -1.70000005e+00,
     -2.59999990e+00, -2.59999990e+00, -2.59999990e+00,
     -2.59999990e+00, -2.59999990e+00, -2.59999990e+00,
     -2.40000010e+00, -2.40000010e+00, -2.40000010e+00,
     -2.40000010e+00, -2.40000010e+00, -2.40000010e+00,
     -2.40000010e+00, -2.29999995e+00, -2.29999995e+00,
     -2.00000000e+00, -1.89999998e+00, -1.70000005e+00,
     -1.70000005e+00, -1.60000002e+00, -1.60000002e+00,
     -1.60000002e+00, -1.60000002e+00, -1.60000002e+00,
     -1.39999998e+00, -1.29999995e+00, -1.39999998e+00,
     -1.29999995e+00, -1.29999995e+00, -1.29999995e+00,
     -1.29999995e+00, -1.29999995e+00, -1.10000002e+00,
     -6.99999988e-01, -6.99999988e-01, -6.99999988e-01,
     -6.99999988e-01, -4.00000006e-01, -4.00000006e-01,
     -8.00000012e-01, -1.89999998e+00, -2.00000000e+00,
     -1.89999998e+00, -1.89999998e+00, -1.89999998e+00,
     -1.89999998e+00, -1.89999998e+00, -2.00000000e+00],
    [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
    [ 1.62184462e-01,  1.80128723e-01, -1.42879575e-01,
      1.08350001e-01,  3.65721397e-02,  1.44240007e-01,
      1.26294360e-01,  1.08350001e-01,  1.08350001e-01,
      3.65682878e-02,  5.45155331e-02,  1.44240007e-01,
      3.65712829e-02,  7.24642798e-02,  1.56203046e-01,
      1.26294464e-01,  2.51910001e-01,  7.24612847e-02,
      2.16015726e-01,  7.24600032e-02,  1.80132136e-01,
      2.16016144e-01,  1.44240007e-01,  2.16017142e-01,
      1.80130005e-01,  1.98070183e-01,  1.44233584e-01,
      1.80130854e-01,  2.87800014e-01,  1.98075429e-01,
      2.87800014e-01,  2.16017440e-01,  1.80128723e-01,
     -1.72653217e-02,  7.24608526e-02,  6.79999997e-04,
      1.80130005e-01,  2.16019779e-01,  2.16018289e-01,
      9.04037133e-02,  2.16017142e-01,  7.24600032e-02,
      1.08350001e-01,  1.80130005e-01, -7.10999966e-02,
     -1.42879993e-01, -1.78770006e-01, -7.10999966e-02,
     -3.94112557e-01, -4.65889990e-01, -5.37670016e-01],
    [ 4.13414478e-01,  5.39032578e-01,  2.51911074e-01,
      3.23689997e-01,  3.95470440e-01,  3.65700014e-02,
      1.86256412e-02, -7.11002126e-02,  7.24600032e-02,
      1.86245721e-02,  4.67247874e-01,  6.79999997e-04,
      2.16020435e-01,  3.65700014e-02,  3.95469993e-01,
      4.49308753e-01,  1.44240007e-01,  3.59579563e-01,
      3.65691446e-02,  5.03139973e-01,  9.04066041e-02,
      2.33965635e-01,  7.24600032e-02, -1.78766012e-01,
      3.95469993e-01,  4.85196590e-01,  3.95471931e-01,
      2.87800014e-01,  2.87800014e-01,  9.04037133e-02,
      3.23689997e-01,  2.16019362e-01,  2.16020852e-01,
      2.16020644e-01,  3.23689133e-01,  2.87800014e-01,
      2.16020003e-01,  2.87800431e-01,  2.51912862e-01,
      3.05747151e-01,  4.07433808e-01,  2.51910001e-01,
      5.39030015e-01,  1.80130005e-01,  1.80130005e-01,
      1.62187681e-01,  2.16014221e-01,  2.51910001e-01,
      2.69854039e-01,  2.87800014e-01,  1.44240007e-01],
    [ 2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  2.52000000e+02,  2.52000000e+02,
      2.52000000e+02,  5.60000000e+01,  5.60000000e+01,
      5.40000000e+01,  5.40000000e+01,  5.20000000e+01,
      5.20000000e+01,  4.90000000e+01,  4.90000000e+01],
    [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
      0.00000000e+00, -4.00000000e+01, -4.00000000e+01,
     -4.10000000e+01, -4.10000000e+01, -4.10000000e+01,
     -4.10000000e+01, -4.10000000e+01, -4.10000000e+01]]])

Could you copy-paste the full traceback? I'm pretty confident that the error is raised here:

X_sax_above_thresh = pipeline.fit_transform(X_window[~idx])

but just to be sure.

Could you try applying BagOfWords to each feature independently and see if an error is raised? I'm still surprised that it works for SAX-VSM because it literally just calls BagOfWords and merges the bag of words for time series that belong to the same class. So it may be an issue with MultivariateTransformer but it does not perform this check.

I just tried to apply BagOfWords separately on each feature as you suggested and also got the error.

This is the original error message from using the Multivariate Transformer.

Full error message
ValueError                                Traceback (most recent call last)
<timed exec> in <module>

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\base.py in fit_transform(self, X, y, **fit_params)
     69         if y is None:
     70             # fit method of arity 1 (unsupervised transformation)
---> 71             return self.fit(X, **fit_params).transform(X)
     72         else:
     73             # fit method of arity 2 (supervised transformation)

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\multivariate\transformation\multivariate.py in transform(self, X)
     97 
     98         X_transformed = [transformer.transform(X[:, i, :])
---> 99                          for i, transformer in enumerate(self.estimators_)]
    100         all_sparse = np.all([isinstance(X_transformed_i, csr_matrix)
    101                              for X_transformed_i in X_transformed])

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\multivariate\transformation\multivariate.py in <listcomp>(.0)
     97 
     98         X_transformed = [transformer.transform(X[:, i, :])
---> 99                          for i, transformer in enumerate(self.estimators_)]
    100         all_sparse = np.all([isinstance(X_transformed_i, csr_matrix)
    101                              for X_transformed_i in X_transformed])

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\bag_of_words\bow.py in transform(self, X)
    348             )
    349         )
--> 350         X_sax_above_thresh = pipeline.fit_transform(X_window[~idx])
    351 
    352         # Concatenate SAX words

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
    385             fit_params_last_step = fit_params_steps[self.steps[-1][0]]
    386             if hasattr(last_step, 'fit_transform'):
--> 387                 return last_step.fit_transform(Xt, y, **fit_params_last_step)
    388             else:
    389                 return last_step.fit(Xt, y,

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\base.py in fit_transform(self, X, y, **fit_params)
     35         if y is None:
     36             # fit method of arity 1 (unsupervised transformation)
---> 37             return self.fit(X, **fit_params).transform(X)
     38         else:
     39             # fit method of arity 2 (supervised transformation)

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\approximation\sax.py in transform(self, X)
     96             raise_warning=self.raise_warning
     97         )
---> 98         indices = discretizer.fit_transform(X)
     99         if isinstance(alphabet, str):
    100             return indices

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\base.py in fit_transform(self, X, y, **fit_params)
     35         if y is None:
     36             # fit method of arity 1 (unsupervised transformation)
---> 37             return self.fit(X, **fit_params).transform(X)
     38         else:
     39             # fit method of arity 2 (supervised transformation)

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\preprocessing\discretizer.py in transform(self, X)
    130         n_samples, n_timestamps = X.shape
    131         self._check_params(n_timestamps)
--> 132         self._check_constant(X)
    133 
    134         bin_edges = self._compute_bins(

c:\programdata\anaconda3\envs\adas-ad\lib\site-packages\pyts\preprocessing\discretizer.py in _check_constant(self, X)
    151     def _check_constant(self, X):
    152         if np.any(np.max(X, axis=1) - np.min(X, axis=1) == 0):
--> 153             raise ValueError("At least one sample is constant.")
    154 
    155     def _compute_bins(self, X, n_samples, n_bins, strategy):

ValueError: At least one sample is constant.


If the error occurs on each feature, let's focus on a single feature (univariate case).

If you look at the source code, you can see that:

  • If strategy='normal' (which is the default value), each time series is standardized.
  • All the subsequences are extracted using the window_size and window_step parameters.
  • The subsequences whose standard deviation is below threshold_std are identified.
  • Only the subsequences whose standard deviation is above the threshold are used as input of the pipeline (StandardScaler, PiecewiseAggregationApproximation, SymbolicAggregateApproximation) that raises the error.

A possible explanation would be that, after the first two steps (StandardScaler + PAA), a subsequence whose standard deviation above the threshold has been transformed into a constant subsequence. However, if a subsequence has a positive standard deviation, it is not constant. Even after the first two steps (StandardScaler + PAA), a non-constant subsequence should not be transformed into a constant subsequence.

Here is code snippet. Could you try it with a data set of univariate time series (any feature that raises the error)?

import numpy as np
from pyts.approximation import PiecewiseAggregateApproximation, SymbolicAggregateApproximation
from pyts.bag_of_words import BagOfWords
from pyts.datasets import load_gunpoint
from pyts.preprocessing import StandardScaler
from pyts.utils import windowed_view


# Toy daaset
X, _, _, _ = load_gunpoint(return_X_y=True)

# Parameters
window_size= 20
word_size = 3
n_bins = 3
window_step = 1
threshold_std = 0.01

# Compute the number of windows
n_samples, n_timestamps = X.shape
n_windows = (n_timestamps - window_size + window_step) // window_step

# Standardize each time series independently
X_scaled = StandardScaler().transform(X)

# Extract all the subsequences
X_window = windowed_view(X_scaled, window_size).reshape(n_samples * n_windows, window_size)

# Identify subsequences whose standard deviation is below the threshold
idx = np.std(X_window, axis=1) < threshold_std

# Standardize each subsequence
X_window_above_thresh_scaled = StandardScaler().transform(X_window[~idx])

# Apply PAA to each standardized subsequence
X_window_above_thresh_scaled_paa = PiecewiseAggregateApproximation(
    window_size=None, output_size=word_size
).transform(X_window_above_thresh_scaled)

# Check if a PAA standardized subsequence is constant
if np.any(np.max(X_window_above_thresh_scaled_paa, axis=1) - np.min(X_window_above_thresh_scaled_paa, axis=1) == 0):
    raise ValueError("At least one subsequence is constant.")
    
# Apply SAX
X_window_above_thresh_scaled_paa_sax = SymbolicAggregateApproximation(
    n_bins=n_bins
).transform(X_window_above_thresh_scaled_paa)

For 3 of my features I get the constant sample error and for 2 others I get this error:

176: UserWarning: Some quantiles are equal. The number of bins will be smaller for sample [0]. Consider decreasing the number of bins or removing these samples.
  .format(samples))

Could you help me with interpreting the error? If quantiles are equal doesn't it mean it's constant?
So only 2 of my 7 features throw no errors/warnings.

Which strategy are you using to compute the bins? I thought that you were using strategy='normal' since it's the default value for BagOfWords and SAXVSM (and that's what is used in the original paper describing SAX-VSM). This warning should only be raised if the bin edges are computed with the quantiles of the time series (strategy='quantile').

This warning is raised when you compute the bin edges using the quantiles of the time series and that back-to-back quantiles are equal. Imagine that you use 4 bins, thus the 5 bin edges are q_0, q_25, q_50, q_75 and q_100, where q_N is the N-th percentile of the time series. Now imagine that a time series takes a given value many times. This means that you will have two back-to-back bin edges that are equal, and thus no value will be binned in this bin, which means that a symbol would never occur in the word (for instance no "a" or no "b"). In particular, this issue is likely to occur if you have long subsequences in which the time series is constant.

I just copy-pasted your code and provided my data as X, since I don't see that parameter explicitly stated it should take the default value. And in my previous experiments, I have not used this parameter, so if strategy='normal' is the default it should be the one used.

Do you get the error if you try with the 2-sample array I provided?

This warning is raised when you compute the bin edges using the quantiles of the time series and that back-to-back quantiles are equal. Imagine that you use 4 bins, thus the 5 bin edges are q_0, q_25, q_50, q_75 and q_100, where q_N is the N-th percentile of the time series. Now imagine that a time series takes a given value many times. This means that you will have two back-to-back bin edges that are equal, and thus no value will be binned in this bin, which means that a symbol would never occur in the word (for instance no "a" or no "b"). In particular, this issue is likely to occur if you have long subsequences in which the time series is constant.

But why is this a problem? As in all the letters (bins) don't need to appear in one window, right? So you could have something like 'ccb' for word lenght of 3 and alphabet size of 3 for example. Or is this just a message to let the user know? This maybe leads into the additional question I posed at the beginning. Why
'n_bins' must be greater than or equal to 2 and lower than or equal to min(word_size, 26)

Why is having say alphabet size of 5 and word length of 3 not all rigth?

My bad, I didn't understand that you ran the code that I provided, I thought that you were still referring to your code. SAX uses strategy='quantile' by default, so it should be changed. But that does solve the error...

Could you identify which subsequences are constant? I'm still surprised that you can get constant subsequences from non-constant subsequences with the StandardScaler + PAA transformation. With something like this:

# Standardize each subsequence
X_window_scaled = StandardScaler().transform(X_window)

# Apply PAA to each standardized subsequence
X_window_scaled_paa = PiecewiseAggregateApproximation(
    window_size=None, output_size=word_size
).transform(X_window)

# Identify constant scaled PAA subsequences
idx_constant = np.where(X_window_scaled_paa.std() < 1e-30)[0]
print(idx_constant)
print(X_window[idx_constant].std(axis=1))

Do you get the error if you try with the 2-sample array I provided?

I don't get the error. Do you get it?

But why is this a problem? As in all the letters (bins) don't need to appear in one window, right? So you could have something like 'ccb' for word lenght of 3 and alphabet size of 3 for example. Or is this just a message to let the user know? This maybe leads into the additional question I posed at the beginning. Why
'n_bins' must be greater than or equal to 2 and lower than or equal to min(word_size, 26)? Why is having say alphabet size of 5 and word length of 3 not all right?

Usually you don't want to have more bins that the number of unique values because you would have some bins that are not used. The objective of binning is to summarize information in a smaller space (you don't look at the exact value, but only in which range the value falls). If you have more bins than the number of values, you may amplify the noise in the signal. Finally, the number of possible words is alphabet_size ** word_size, so it exponentially increases and you don't want the alphabet size to be too large compared to the word size.

It was shown in a study (I think the one introducing SAX) that the histogram of a standardized time series is similar to the standard normal distribution, which means that, if you use the quantiles of the standard normal distribution, you should have (in expectation) roughly the same number of values falling inside each bin, meaning that, in a word, each symbol should be have the same frequency. This is why the quantiles of the standard normal distribution are usually used in SAX.

So you actually expect to have the same number of symbols in each word. What you are looking at is the ordering of the symbols.

Alright reproducible code! (Hopefully )

from pyts.bag_of_words import BagOfWords
bow = BagOfWords(window_size=20, word_size=3, n_bins=3, threshold_std=0.01)

X = np.array([[-2.        , -2.29999995, -2.29999995, -2.4000001 , -2.5       ,
        -2.5       , -2.5       , -2.5999999 , -2.5       , -2.5       ,
        -2.5       , -2.5       , -2.5       , -2.5       , -2.5       ,
        -2.5       , -2.5       , -2.5       , -2.5       , -2.5       ,
        -2.5       , -2.5       , -2.5       , -2.5       , -2.5       ,
        -2.5       , -2.5       , -2.5       , -2.5       , -2.5       ,
        -2.5       , -2.5       , -2.5       , -2.5       , -2.5       ,
        -2.5       , -2.5       , -2.5       , -2.5       , -2.5       ,
        -2.5       , -2.5       , -2.5       , -2.5       , -2.5       ,
        -2.5       , -2.79999995, -2.9000001 , -3.29999995, -3.4000001 ,
        -3.9000001 ]])

bow.fit_transform(X)

This gives me the constant time series error

Thank you for the second point about the bins. I understand that some bins might not be used if alphabet size is greater than the word length, but my thinking was that that could indicate a steeper change in the signal and be an interesting pattern. Is this wrong? In one of the SAX papers (I don't recall which one at the top of my head) they seemed to recommend an alphabet size 4-6, saying that it's alright for most application, but then by what you said it implies that the word-size has to be at least that as well?

Additionally, the 'saxpy' implementation allows for alphabet size to be larger than word length. I had tried 5 letters and word length of 3 for my classification problem. So, I then got words like dcc, ebb, and eca, which were also the top features considered by a RF classifier. So then you're saying that increasing the word size, or decreasing the alphabet size would be more meaningful? When I saw that ebb and eca are considered important, it confirmed my own bias about the 'quickly changing patterns' as those are not adjacent bins.

Thank you very much for the reproducible code!

A possible explanation would be that, after the first two steps (StandardScaler + PAA), a subsequence whose standard deviation above the threshold has been transformed into a constant subsequence. However, if a subsequence has a positive standard deviation, it is not constant. Even after the first two steps (StandardScaler + PAA), a non-constant subsequence should not be transformed into a constant subsequence.

My intuition was right but my conclusion was wrong: a non-constant subsequence can be transformed into a constant subsequence after using PAA:

# Before PAA
[ 3.16227766,  0.        ,  0.        ,  0.        , -3.16227766,
  0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
  0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
  0.        ,  0.        ,  0.        ,  0.        ,  0.        ]
 
# After PAA with word_size=3
[ 0.        ,  0.        ,  0.        ]

I should probably get rid of this constant sample check because it's a bit of an edge case that leads to too many issues...

Additionally, the 'saxpy' implementation allows for alphabet size to be larger than word length. I had tried 5 letters and word length of 3 for my classification problem. So, I then got words like dcc, ebb, and eca, which were also the top features considered by a RF classifier. So then you're saying that increasing the word size, or decreasing the alphabet size would be more meaningful? When I saw that ebb and eca are considered important, it confirmed my own bias about the 'quickly changing patterns' as those are not adjacent bins.

I see what you mean. Since the standardization is performed on the whole subsequence (before PAA and not after), having a number of bins larger than the word size is not an issue (but I guess it's better to have a number of bins not larger than the window size so that each bin has at least one value from the subsequence). Applying PAA indicates in which bin the mean value falls, which could indicate for instance if there's a plateau around the minimum (which may lead to a a symbol) or if the minimum is around a sharp peak with larger values around (which may lead to a b symbol). This restriction on the number of symbols is not needed (and may be bad since it may lead to underfitting!).

I will try fix this tomorrow (which should consist in applying the same fix that I added to BagOfWords to KBinsDiscretizer and removing the _check_constant method). Thank you for your insight, it really helps me improving the package!

The fix (#110) has been merged into the main branch. I hope that it fixes everything this time :)

Thanks @johannfaouzi - I came across this error using 0.11 and installed from main (ff746ef) and it's working for me!

Yes, thank you, this seems to work for me as well :)

What would then be the syntax to use the multivariate transformer with SAX-VSM?
I tried the following, which unsurprisingly didn't work as calling both the SAX-VSM and then BOW doesn't make sense.
Or would I then still do the rest steps with my won code? (tf-idf transformation and cosine similarity classification)

from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords
from pyts.classification import SAXVSM
from sklearn.pipeline import make_pipeline

bow = BagOfWords(window_size=20, word_size=3, n_bins=5)
transformer = MultivariateTransformer(bow, flatten=True)
sax_vsm = SAXVSM(window_size=20, word_size=3, n_bins=5, strategy='normal')
clf = make_pipeline(transformer, sax_vsm)
clf.fit(X, y)

I also noticed that just this

transformer = MultivariateTransformer(BagOfWords(window_size=20, word_size=3, n_bins=5), flatten=True)`
new = transformer.fit_transform(X)

gives me slightly different results compared to using sax_via_window from saxpy with the same parameters,

saxw = sax_via_window(signal, win_size=20, paa_size=3, alphabet_size=5,
                             nr_strategy='exact',  z_threshold=0.01)

Namely, from pyts I get more words than from saxpy, all the words from saxpy form a subset from the results from pyts so there is no mismatch (frequencies also are the same) I just get a few more from pyts. Is there something that comes to mind that could explain this?

What would then be the syntax to use the multivariate transformer with SAX-VSM?
I tried the following, which unsurprisingly didn't work as calling both the SAX-VSM and then BOW doesn't make sense.
Or would I then still do the rest steps with my won code? (tf-idf transformation and cosine similarity classification)

SAXVSM is implemented as a classifier and works on raw time series, so unfortunately, if you want to perform the remaining steps, you would have to do it with your own code (feel free to have a look at the source code of SAXVSM). How do you plan to work on the multivariate case (not only for TF but also IDF)? Do you consider different features as different documents? I don't mind helping you a bit with the remaining code, but I would need more information on how you want to handle the multivariate case.

Since SAXVSM is implemented as a classifier, it can be used with pyts.multivariate.classification.MultivariateClassifier to make it work for the multivariate case using majority voting (by fitting an independent classifier on each feature). So you have one TF-IDF matrix for each feature, you compute the cosine similarity scores for each class to obtain the prediction for each feature, then majority voting is performed to output a single prediction:

from pyts.multivariate.classification import MultivariateClassifier
from pyts.classification import SAXVSM
from pyts.datasets import load_basic_motions

X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)

clf = MultivariateClassifier(
    SAXVSM(window_size=20, word_size=6, n_bins=6, strategy='normal'),
)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

I also noticed that just this gives me slightly different results compared to using sax_via_window from saxpy with the same parameters. Namely, from pyts I get more words than from saxpy, all the words from saxpy form a subset from the results from pyts so there is no mismatch (frequencies also are the same) I just get a few more from pyts. Is there something that comes to mind that could explain this?

I tested it out and also noticed this, but didn't notice this when I set a very low value to the threshold parameter (like 1e-50). So I looked at the source code of the standardization function in saxpy, znorm, and noticed that they use the covariance (to deal with multivariate time series), so for univariate time series, the threshold used is actually the variance and not the standard deviation. Also, they use the "biased" formula for the variance (the denominator is N instead of N-1), while the unbiased formula is used in pyts (but it should be negligible). Could you try setting znorm_threshold=threshold_std ** 2 and see if you get the same results?

What I do presently is almost what the BOW transformation returns
From pyts BOW:
image

and I have it like this:
image

so I do the feature counts for each pattern keeping track from which signal they came with a suffix. Then I do the tf-idf transformation considering each sample/instance as a document. But depending on what you do that's not necessary. Once you have that matrix you can use any sort of a standard ML classifier on it or do the SAX-VSM with creating the two vectors for each class. I was interested in SAX-VSM as I'm not sure that I implemented the matrix containing vectors for each class correctly. I just summed together everything belonging to one class and then did the tf-idf.

I don't know how you compute the word counts and the TF-IDF matrix but I would suggest using the tools made available in scikit-learn: sklearn.feature_extraction.text.TfidfVectorizer and sklearn.feature_extraction.text.CountVectorizer.

SAX-VSM considers that each class is a document, not that each sample is a document:

Bags are combined into a corpus, which is built as a term frequency matrix, whose rows correspond to the set of all SAX words (terms) found in all classes, whereas each column denotes a class of the training set. Each element of this matrix is an observed frequency of a term in a class.

Capture d’écran 2021-09-01 à 21 19 39

So you need to merge the bag of words for all the samples that belong to the same class (or equivalently sum the histograms, i.e. rows, of your matrix for all the samples that belong to the same class). The first approach is currently used in pyts:

X_class = [' '.join(X_bow[y_ind == classe])
for classe in range(n_classes)]

Here is an example of an implementation of the algorithm matching your description to handle the multivariate case:

import numpy as np
import pandas as pd
from pyts.bag_of_words import BagOfWords
from pyts.datasets import load_basic_motions
from sklearn.base import clone
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)


###################
# T R A I N I N G #
###################

le = LabelEncoder()
y_train_ind = le.fit_transform(y_train)

n_features = X_train.shape[1]

bow_list = [clone(BagOfWords(window_size=20, word_size=4, n_bins=4, strategy='normal'))
            for _ in range(n_features)]

# Create an empty list that will be be transformed into a DataFrame with the TF-IDF matrix
df_tfidf = []

# Create a list to save all the instances of TfidfVectorizer
tfidf_list = []

for i, bow in enumerate(bow_list):

    # Get the BoW transformation for this feature
    X_bow = bow.transform(X_train[:, i])
    
    # Group bags of words for samples belonging to the same class
    X_class = [' '.join(X_bow[y_train_ind == classe])
                   for classe in range(le.classes_.size)]

    # Create an instance of TfidfVectorizer
    tfidf = TfidfVectorizer(
        norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True
    )
    
    # Get TF-IDF matrix
    tfidf_matrix = tfidf.fit_transform(X_class).A
    
    tfidf_list.append(tfidf)
    
    # Get dictionary for this feature
    dictionary = {value: key for key, value in tfidf.vocabulary_.items()}
    dictionary = [word + f'_{i + 1}'
                  for word in np.vectorize(dictionary.get)(np.arange(tfidf_matrix.shape[1]))]

    df_tfidf.append(pd.DataFrame(data=tfidf_matrix, columns=dictionary, index=le.classes_))

# Concatenate each TF-IDF matrix
df_tfidf = pd.concat(df_tfidf, axis=1)


#####################
# I N F E R E N C E #
#####################

# Create an empty list that will be be transformed into a DataFrame with the
# histograms for all the test time series
df_test = []

for i, (bow, tfidf) in enumerate(zip(bow_list, tfidf_list)):

    # Get the BoW transformation for this feature
    X_bow = bow.transform(X_test[:, i])
    
    # Create an instance of CountVectorizer
    vectorizer = CountVectorizer(vocabulary=tfidf.vocabulary_)
    
    # Get the histogram (word count) for each test time series
    X_histogram = vectorizer.transform(X_bow).A

    df_test.append(pd.DataFrame(data=X_histogram))
    
# Concatenate each histogram
df_test = pd.concat(df_test, axis=1)
df_test.columns = df_tfidf.columns

# Compute cosine similarity scores
cosine_similarity_scores = cosine_similarity(df_test, df_tfidf)

# Return the class with the highest cosine similarity
y_pred = le.classes_[cosine_similarity_scores.argmax(axis=1)]

# Compute the accuracy score
accuracy_score(y_test, y_pred)

Thanks again for your help!

So you need to merge the bag of words for all the samples that belong to the same class (or equivalently sum the histograms, i.e. rows, of your matrix for all the samples that belong to the same class). The first approach is currently used in pyts:

Ahh, yes thank you for this, I was summing up the frequencies and I was wondering if that's all right. I see now that pyts implementation is much more efficient/faster than my own attempts with saxpy, so I'm very grateful for your responsiveness and support!

And if I want to just have one huge BOW for each sample representing the tf-idf scre, so (n_samples, bow_all_features) would this approach be ok? If everything's correct then I'm trying to get the tf-idf scores for each feature for all the samples and then concatenate the features together. It seems to be working, but I just wanted to run it by you just in case.

from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords

X_new = transformer.fit_transform(X_train_val)

features_tfidf = []
for i in range(X_new.shape[1]):
    tfidf = TfidfVectorizer(
        norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True
    )
    feature = X_new[:, i]
    tfidf_matrix = tfidf.fit_transform(feature).A
    
    dictionary = {value: key for key, value in tfidf.vocabulary_.items()}
    pattern_names = [word + f'_{i}'
                  for word in np.vectorize(dictionary.get)(np.arange(tfidf_matrix.shape[1]))]
    
    df_idf = pd.DataFrame(tfidf_matrix, columns=pattern_names)
    features_tfidf.append(df_idf)
    
dfx_idf = pd.concat(features_tfidf, axis=1)

I want to eventually set up crossvalidation and grid search for the word length and bin count for the BOW transformation. I know how it can be done relatively easily with scikit-learn ML models. I suspect that it would be trickier with the SAX-VSM implementation you shared. But if I had something like data > BOW from pyts > tf-idf transformation (like in my code snippet) > ML classifier from scikit could it work with pipeline from scikit-learn? I haven't much experience with it, but conceptually it seems to fit the case.

I think I'm close to getting my grid search working; the main issue I'm having is that when using the RF classifier each split needs to have the same amount of features. But since if BOW only returns the patterns it found and not the summary of all possible words then the pipeline fails when new features appear as the classifier isn't expecting them.

To circumvent this I tried to infer the word size and number of bins from the output of BOW. Word size is easy to do and so I thought was also the number of bins. My idea was to create a feature for all the possible words to create consistent input to the classifier and if they're not found simply set them to 0. This still didn't work, it was complaining about the number of features that the classifier is getting and what it is actually expecting. Do you know of any way I could make this kind of pipeline feasible?

Code:

from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import itertools


def bow_to_matrix(X):  
    features_tfidf = []
    for i in range(X.shape[1]):
        tfidf = TfidfVectorizer(
            norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True
        )
        feature = X[:, i]
        tfidf_matrix = tfidf.fit_transform(feature).A

        dictionary = {value: key for key, value in tfidf.vocabulary_.items()}
        pattern_names = [word + f'_{i}'
                      for word in np.vectorize(dictionary.get)(np.arange(tfidf_matrix.shape[1]))]

        df_idf = pd.DataFrame(tfidf_matrix, columns=pattern_names)
        
        ## trying to infer number of  bins and word size from the transformation
        unique_patterns_found = [word for word in np.vectorize(dictionary.get)(np.arange(tfidf_matrix.shape[1]))]
        word_size = len(unique_patterns_found[0])
        alphabet_size = len(set(''.join(unique_patterns_found)))
        
        alphabet = bytearray(range(97,123)).decode("utf-8")
        alphabet = alphabet[:alphabet_size]

        alpha = [x for x in alphabet]
        bow_all = [p for p in itertools.product(alpha, repeat=word_lenght)]
        bow_all = [''.join(x) for x in bow_all]
        
        patterns_all = [f'{x}_{i}' for x in bow_all]
        patterns_not_found = [x for x in patterns_all if x not in pattern_names]        
        
        df_idf[patterns_not_found] = 0
        df_idf = df_idf[sorted(df_idf.columns)]

             
        features_tfidf.append(df_idf)
  
    
    dfx_idf = pd.concat(features_tfidf, axis=1)

    return dfx_idf

transformer = MultivariateTransformer(BagOfWords(), flatten=True)
tf_idf_transformer = FunctionTransformer(func=bow_to_matrix)
rf = RandomForestClassifier()

steps = [('SAX_transform', transformer), ('tf_idf_transform', tf_idf_transformer), ('rf', rf)]
pipeline = Pipeline(steps)

parameter_grid = [{'SAX_transform__estimator__word_size': [3,4,5,6],
             'SAX_transform__estimator__n_bins': [3,4,5,6],
                  'SAX_transform__estimator__window_size': [12, 20]}]

grid_search = GridSearchCV(pipeline, parameter_grid, cv=5, scoring='recall', verbose=3)

grid_search.fit(X_train_val, y_train_val)

And if I want to just have one huge BOW for each sample representing the tf-idf scre, so (n_samples, bow_all_features) would this approach be ok? If everything's correct then I'm trying to get the tf-idf scores for each feature for all the samples and then concatenate the features together. It seems to be working, but I just wanted to run it by you just in case.

Yeah, there's nothing wrong about this approach.

I want to eventually set up crossvalidation and grid search for the word length and bin count for the BOW transformation. I know how it can be done relatively easily with scikit-learn ML models. I suspect that it would be trickier with the SAX-VSM implementation you shared. But if I had something like data > BOW from pyts > tf-idf transformation (like in my code snippet) > ML classifier from scikit could it work with pipeline from scikit-learn? I haven't much experience with it, but conceptually it seems to fit the case.

Yes, it is definitely possible and a great idea. Pipeline tools from scikit-learn are awesome and can prevent a lot of issues when doing "hand-made" cross-validation with a lot of steps.

A pipeline is a chain of estimators such that:

  • All but the last estimator is a transformer (must have fit, transform and fit_transform methods).
  • The last estimator can be either a transformer or a predictor (in your case a classifier).

When you call fit on your pipeline:

  • All but the last estimator calls fit_transform.
  • The last estimator calls fit.

When you call predict on your pipeline:

  • All but the last estimator calls transform.
  • The last estimator calls predict.

So you need to create a transformer. I find it more simple to create a new class that incorporates both BOW from pyts > tf-idf transformation by using the code that I provided. One issue with your current implementation is that, in bow_to_matrix, you always call fit_transform in TfidfVectorizer, but you don't want to learn the IDF terms from your test data, only from your training data! Using fit or fit_transform in the inference phase is forbidden.

To circumvent this I tried to infer the word size and number of bins from the output of BOW. Word size is easy to do and so I thought was also the number of bins. My idea was to create a feature for all the possible words to create consistent input to the classifier and if they're not found simply set them to 0. This still didn't work, it was complaining about the number of features that the classifier is getting and what it is actually expecting. Do you know of any way I could make this kind of pipeline feasible?

You can provide the vocabulary to TfidfVectorizer using the vocabulary parameter. In your case, you want the vocabulary to have all the possible words. The number of possible words is n_bins ** word_size, which may be high depending on the values of n_bins and word_size. A downside is that it might lead to large (sparse) arrays, which requires a lot of RAM. A possible solution would be to use scipy.sparse.csr_matrix but not many scikit-learn classifiers are compatible with this (logistic regression is). Let me know if this is an issue, I could easily change the code below.

Here is an example in which I create a BagOfWordsTfidfTransformer class that performs both BOW transformation and TF-IDF transformation (you can add more parameters to the class if you want to change other parameters of BagOfWords and TfidfVectorizer):

from itertools import product
import numpy as np
from pyts.bag_of_words import BagOfWords
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer


class BagOfWordsTfidfTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, window_size=20, word_size=4, n_bins=4, strategy='normal',
                 norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True):
        self.window_size = window_size
        self.word_size = word_size
        self.n_bins = n_bins
        self.strategy = strategy
        self.norm = norm
        self.use_idf = use_idf
        self.smooth_idf = smooth_idf
        self.sublinear_tf = sublinear_tf
        
    def fit(self, X, y=None):
        """Fit the transformer.
        
        Parameters
        ----------
        X : array, shape = (n_samples, n_features, n_timestamps)
            Dataset of multivariaye time series
            
        y : array, shape = (n_samples,)
            Target values (ignored)
            
        Returns
        -------
        self : object
        """
        le = LabelEncoder()
        y_train_ind = le.fit_transform(y_train)
    
        n_features = X.shape[1]
        
        # Create a BagOfWords instance for each feature
        bow_list = [
            clone(BagOfWords(window_size=self.window_size, word_size=self.word_size,
                             n_bins=self.n_bins, strategy=self.strategy, alphabet=None))
            for _ in range(n_features)
        ]
        
        # Compute the vocabulary for each feature
        alphabet = np.array([chr(i) for i in range(97, 97 + self.n_bins)])
        vocabulary = [''.join(word) for word in product(alphabet, repeat=self.word_size)]

        # Create a list to save all the instances of TfidfVectorizer
        tfidf_list = []

        for i, bow in enumerate(bow_list):

            # Get the BoW transformation for this feature
            X_bow = bow.transform(X[:, i])

            # Create an instance of TfidfVectorizer
            tfidf = TfidfVectorizer(
                norm=self.norm, use_idf=self.use_idf,
                smooth_idf=self.smooth_idf, sublinear_tf=self.sublinear_tf,
                vocabulary=vocabulary
            )

            # Fit the instance
            tfidf.fit(X_bow)
            tfidf_list.append(tfidf)
            
        # Save the instances
        self._bow_list = bow_list
        self._tfidf_list = tfidf_list
        
        # Save the vocabulary
        vocabulary_list = [
            [''.join(word) + f'_{i}' for word in vocabulary]
            for i in range(n_features)
        ]
        self.vocabulary_ = np.concatenate(vocabulary_list)

        return self
    
    def fit_transform(self, X, y=None):
        """Fit the transformer and return the transformation.
        
        Parameters
        ----------
        X : array, shape = (n_samples, n_features, n_timestamps)
            Dataset of multivariaye time series
            
        y : array, shape = (n_samples,)
            Target values (ignored)
            
        Returns
        -------
        X_new : array, shape = (n_samples, n_words)
            TF-IDF matrix.
    
        """
        le = LabelEncoder()
        y_train_ind = le.fit_transform(y_train)
    
        n_features = X.shape[1]
        
        # Create a BagOfWords instance for each feature
        bow_list = [
            clone(BagOfWords(window_size=self.window_size, word_size=self.word_size,
                             n_bins=self.n_bins, strategy=self.strategy))
            for _ in range(n_features)
        ]
        
        # Compute the vocabulary for each feature
        alphabet = np.array([chr(i) for i in range(97, 97 + self.n_bins)])
        vocabulary = [''.join(word) for word in product(alphabet, repeat=self.word_size)]
    
        # Create a list to save all the TF-IDF matrices
        tfidf_matrix = []

        # Create a list to save all the instances of TfidfVectorizer
        tfidf_list = []

        for i, bow in enumerate(bow_list):

            # Get the BoW transformation for this feature
            X_bow = bow.transform(X[:, i])

            # Create an instance of TfidfVectorizer
            tfidf = TfidfVectorizer(
                norm=self.norm, use_idf=self.use_idf,
                smooth_idf=self.smooth_idf, sublinear_tf=self.sublinear_tf,
                vocabulary=vocabulary
            )

            # Fit the instance and transform the input
            tfidf_matrix.append(tfidf.fit_transform(X_bow).A)
            tfidf_list.append(tfidf)
            
        # Save the instances
        self._bow_list = bow_list
        self._tfidf_list = tfidf_list
        
        # Save the vocabulary
        vocabulary_list = [
            [''.join(word) + f'_{i}' for word in vocabulary]
            for i in range(n_features)
        ]
        self.vocabulary_ = np.concatenate(vocabulary_list)

        # Return the TF-IDF matrix
        return np.concatenate(tfidf_matrix, axis=1)
    
    def transform(self, X, y=None):
        """Return the transformation.
        
        Parameters
        ----------
        X : array, shape = (n_samples, n_features, n_timestamps)
            Dataset of multivariaye time series
            
        y : array, shape = (n_samples,)
            Target values (ignored)
            
        Returns
        -------
        X_new : array, shape = (n_samples, n_words)
            TF-IDF matrix.
    
        """
        # Create a list to save all the TF-IDF matrices
        tfidf_matrix = []

        for i, (bow, tfidf) in enumerate(zip(self._bow_list, self._tfidf_list)):

            # Get the BoW transformation for this feature
            X_bow = bow.transform(X[:, i])

            # Transform the BoW into a TF-IDF matrix
            tfidf_matrix.append(tfidf.transform(X_bow).A)

        # Return the TF-IDF matrix
        return np.concatenate(tfidf_matrix, axis=1)

This class can be used with scikit-learn tools such as pipelines and grid search with cross-validation:

from pyts.datasets import load_basic_motions
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier


# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)

# Define instances
transformer = BagOfWordsTfidfTransformer(word_size=6)
rfc = RandomForestClassifier(random_state=42)

# Create the pipeline
pipe = Pipeline(steps=[('bowtfidf', transformer), ('rfc', rfc)])
param_grid = {'bowtfidf__window_size': [16, 20, 24], 'bowtfidf__n_bins': [3, 4, 5],
              'rfc__criterion': ['gini', 'entropy']}

# Create the grid search with cross-validation
clf = GridSearchCV(pipe, param_grid, cv=5)

# Fit the instance (grid search with cross-validation on the training set)
clf.fit(X_train, y_train)

# Evaluate the performance on the test set
clf.score(X_test, y_test)

You may add sklearn.feature_selection.VarianceThreshold in the pipeline to remove features (words) that are constant (zero variance), which should decrease the number of features and speed up fitting the classifier. But it does not change the maximum memory usage because it would only be done after the transformation.

Actually the issue only came from the fact that you were calling fit_transform during the inference phase, which led to a different dictionary. When calling only transform during the inference phase, this issue disappears, so there is no need to compute all the possible words.

from itertools import product
import numpy as np
from pyts.bag_of_words import BagOfWords
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer


class BagOfWordsTfidfTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, window_size=20, word_size=4, n_bins=4, strategy='normal',
                 norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True):
        self.window_size = window_size
        self.word_size = word_size
        self.n_bins = n_bins
        self.strategy = strategy
        self.norm = norm
        self.use_idf = use_idf
        self.smooth_idf = smooth_idf
        self.sublinear_tf = sublinear_tf
        
    def fit(self, X, y=None):
        """Fit the transformer.
        
        Parameters
        ----------
        X : array, shape = (n_samples, n_features, n_timestamps)
            Dataset of multivariaye time series
            
        y : array, shape = (n_samples,)
            Target values (ignored)
            
        Returns
        -------
        self : object
        """
        le = LabelEncoder()
        y_train_ind = le.fit_transform(y_train)
    
        n_features = X.shape[1]
        
        # Create a BagOfWords instance for each feature
        bow_list = [
            clone(BagOfWords(window_size=self.window_size, word_size=self.word_size,
                             n_bins=self.n_bins, strategy=self.strategy, alphabet=None))
            for _ in range(n_features)
        ]

        # Create a list to save all the instances of TfidfVectorizer
        tfidf_list = []

        for i, bow in enumerate(bow_list):

            # Get the BoW transformation for this feature
            X_bow = bow.transform(X[:, i])

            # Create an instance of TfidfVectorizer
            tfidf = TfidfVectorizer(
                norm=self.norm, use_idf=self.use_idf,
                smooth_idf=self.smooth_idf, sublinear_tf=self.sublinear_tf,
            )

            # Fit the instance
            tfidf.fit(X_bow)
            tfidf_list.append(tfidf)
            
        # Save the instances
        self._bow_list = bow_list
        self._tfidf_list = tfidf_list
        
        # Save the vocabulary
        vocabulary_list = [
            [word + f'_{i}' for word in np.vectorize(
                {value: key for key, value in tfidf.vocabulary_.items()}.get
            )(np.arange(len(tfidf.vocabulary_)))]
            for i, tfidf in enumerate(tfidf_list)
        ]
        self.vocabulary_ = np.concatenate(vocabulary_list)

        return self
    
    def fit_transform(self, X, y=None):
        """Fit the transformer and return the transformation.
        
        Parameters
        ----------
        X : array, shape = (n_samples, n_features, n_timestamps)
            Dataset of multivariaye time series
            
        y : array, shape = (n_samples,)
            Target values (ignored)
            
        Returns
        -------
        X_new : array, shape = (n_samples, n_words)
            TF-IDF matrix.
    
        """
        le = LabelEncoder()
        y_train_ind = le.fit_transform(y_train)
    
        n_features = X.shape[1]
        
        # Create a BagOfWords instance for each feature
        bow_list = [
            clone(BagOfWords(window_size=self.window_size, word_size=self.word_size,
                             n_bins=self.n_bins, strategy=self.strategy))
            for _ in range(n_features)
        ]
    
        # Create a list to save all the TF-IDF matrices
        tfidf_matrix = []

        # Create a list to save all the instances of TfidfVectorizer
        tfidf_list = []

        for i, bow in enumerate(bow_list):

            # Get the BoW transformation for this feature
            X_bow = bow.transform(X[:, i])

            # Create an instance of TfidfVectorizer
            tfidf = TfidfVectorizer(
                norm=self.norm, use_idf=self.use_idf,
                smooth_idf=self.smooth_idf, sublinear_tf=self.sublinear_tf,
            )

            # Fit the instance and transform the input
            tfidf_matrix.append(tfidf.fit_transform(X_bow).A)
            tfidf_list.append(tfidf)
            
        # Save the instances
        self._bow_list = bow_list
        self._tfidf_list = tfidf_list
        
        # Save the vocabulary
        vocabulary_list = [
            [word + f'_{i}' for word in np.vectorize(
                {value: key for key, value in tfidf.vocabulary_.items()}.get
            )(np.arange(len(tfidf.vocabulary_)))]
            for i, tfidf in enumerate(tfidf_list)
        ]
        self.vocabulary_ = np.concatenate(vocabulary_list)

        # Return the TF-IDF matrix
        return np.concatenate(tfidf_matrix, axis=1)
    
    def transform(self, X, y=None):
        """Return the transformation.
        
        Parameters
        ----------
        X : array, shape = (n_samples, n_features, n_timestamps)
            Dataset of multivariaye time series
            
        y : array, shape = (n_samples,)
            Target values (ignored)
            
        Returns
        -------
        X_new : array, shape = (n_samples, n_words)
            TF-IDF matrix.
    
        """
        # Create a list to save all the TF-IDF matrices
        tfidf_matrix = []

        for i, (bow, tfidf) in enumerate(zip(self._bow_list, self._tfidf_list)):

            # Get the BoW transformation for this feature
            X_bow = bow.transform(X[:, i])

            # Transform the BoW into a TF-IDF matrix
            tfidf_matrix.append(tfidf.transform(X_bow).A)

        # Return the TF-IDF matrix
        return np.concatenate(tfidf_matrix, axis=1)

This time a bit of a delay before my next enormous thank you! :)
I spent some time playing around with this, and have some comments, observations, questions:

  • The code I posted actually worked as well, the error I had was due to a typo I had mixed up word_size and word_lenght, Though I still would have had the issue with fit-transform that you mentioned. Again thank you for taking the time and elaborating on this, this is very instructive for me!
  • I managed to run grid search with the class you provided and in my use case the best parameters were (window_size=12, n_bins=6, word_size=5), so this is actually the case where all bins might not be represented in a word!
  • It took 20hrs to run, with significant increase in time from word_size increasing beyond 4. Ranging from 1.6 to 37minutes per fit.
  • With n_bins >=5 and word_size == 6 (the max I tried) all the CV runs failed due to memory allocation errors like this one:
    numpy.core._exceptions._ArrayMemoryError: Unable to allocate 39.0 GiB for an array with shape (47896, 109375) and data type float64. my data is only around 0.2GB, so I don't know where this is coming from. The first dimension does correspond to the samples in the CV fold, but the second dimension doesn't make sense to me.
  • When I had gotten the best parameters from the grid search, I tried using the class you provided to transform my data only with these parameters, and again got the memory erorr when calling fit_transform on my training data : MemoryError: Unable to allocate 24.3 GiB for an array with shape (59871, 54432) and data type float64. I can do this transformation without issues with the methods I tried before i.e. manually calling MultivariateTransformer and BagOfWords. What could be causing this? Sow now I can't fit this instance separately even though it ran with grid search (I am using a bit more data now as it's not CV but again in total it's only 0.2GB)
  • If I do the manual transformation with MultivariateTransformer and BagOfWords with the optimal parameters (so not using the class you provided) the results are worse than within the gridsearch, I suspect because of the absence of all the features that I generate even when they are not present (still setting them to 0 obviously), which may be throwing the classifier off? These results are falling outside of the crossvalidation STDEV performance scores.

EDIT. I managed to call the best model from grid_search with .best_estimator attribute. And now it seems to fit without the memory error, but it still takes much longer than doing the steps manually as outlined in my last point.

  • It took 20hrs to run, with significant increase in time from word_size increasing beyond 4. Ranging from 1.6 to 37minutes per fit.

It's probably due to the huge increase in the number of features which makes fitting the classifier much longer. Which classifier are you using? Random forest? The algorithmic complexity of the provided BagOfWordsTfidfTransformer does not vary much with different values of word_size and n_bins (slightly increases but not that much). Here is an example below that I ran on my laptop:

from pyts.datasets import load_basic_motions

# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)

%%timeit 
transformer = BagOfWordsTfidfTransformer(n_bins=3, word_size=3)
transformer.fit_transform(X_train, y_train)
# 58.8 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
transformer = BagOfWordsTfidfTransformer(n_bins=6, word_size=5)
transformer.fit_transform(X_train, y_train)
# 85.5 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • With n_bins >=5 and word_size == 6 (the max I tried) all the CV runs failed due to memory allocation errors like this one:
    numpy.core._exceptions._ArrayMemoryError: Unable to allocate 39.0 GiB for an array with shape (47896, 109375) and data type float64. my data is only around 0.2GB, so I don't know where this is coming from. The first dimension does correspond to the samples in the CV fold, but the second dimension doesn't make sense to me.

The second dimension is the number of words that are present in at least one time series. The number of possible words in the univariate case is n_bins ** word_size = 5 ** 6 = 15625. You need to multiply this number by the number of features in the multivariate case. If you have a lot of samples (which is your case), it's very likely that almost all the words will be present in at least one time series, so the number of features will be close to the number of possible words. Hence the huge number of features.

  • When I had gotten the best parameters from the grid search, I tried using the class you provided to transform my data only with these parameters, and again got the memory erorr when calling fit_transform on my training data : MemoryError: Unable to allocate 24.3 GiB for an array with shape (59871, 54432) and data type float64. I can do this transformation without issues with the methods I tried before i.e. manually calling MultivariateTransformer and BagOfWords. What could be causing this? Sow now I can't fit this instance separately even though it ran with grid search (I am using a bit more data now as it's not CV but again in total it's only 0.2GB)

It may depend on the data type of your array. If your background is more focused on mathematics than computer science (as is mine), you need to understand how your data is represented. Float64 is also known as Double-precision floating-point format and is the most precise (thus requiring the most memory) format for float numbers in a numpy array.

At the bottom of this page, you can find a nice summary of the different data types in numpy. For a given data type, you can find out the size (number of bytes) of a single element using the itemsize attribute of this array. Here is a quick summary:

np.array([0], dtype='int8').itemsize      # 1
np.array([0], dtype='uint8').itemsize     # 1
np.array([0], dtype='int16').itemsize     # 2
np.array([0], dtype='uint16').itemsize    # 2
np.array([0], dtype='float16').itemsize   # 2
np.array([0], dtype='int32').itemsize     # 4
np.array([0], dtype='uint32').itemsize    # 4
np.array([0], dtype='float32').itemsize   # 4
np.array([0], dtype='int64').itemsize     # 8
np.array([0], dtype='uint64').itemsize    # 8
np.array([0], dtype='float64').itemsize   # 8

Since a byte corresponds to 8 bits, you can see that the number at the end of the data type indicates the number of bits required to save a single element.

Back to your error, (59871 * 54432) / (1024 ** 3) * 8 = 24.28068 is the necessary memory (in GiB) to have a numpy.array with shape (59871, 54432) and dtype='float64'. When you perform grid search with cross-validation, you have fewer samples in each fold, which may explain why it still worked with GridSearchCV but doesn't work with the full dataset. However, the GridSearchCV instance fits a new model with the best hyper-parameters on the whole training set by default, so my explanation may be wrong.

That being said, there are several solutions to your issue:

  • If I do the manual transformation with MultivariateTransformer and BagOfWords with the optimal parameters (so not using the class you provided) the results are worse than within the gridsearch, I suspect because of the absence of all the features that I generate even when they are not present (still setting them to 0 obviously), which may be throwing the classifier off? These results are falling outside of the crossvalidation STDEV performance scores.

Which code are you exactly using? Normally I think that you should get the same results, but it requires a bit of work to reorder the columns (which should not have a big impact on the performance of your classifier, even for bagging models like random forest):

from pyts.datasets import load_basic_motions
from pyts.multivariate.transformation import MultivariateTransformer

# Create a function that append the feature number as a suffix
def append_suffix(x, n_features):
    liste = []
    for j in range(n_features):
        liste.extend([word + f'_{j}' for word in x[j].split(' ')])
    return ' '.join(liste)

# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)


# Transformation using BagOfWordsTfidfTransformer
transformer = BagOfWordsTfidfTransformer(
    window_size=20, word_size=4, n_bins=4, strategy='normal',
    norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True
)
X_new_train = transformer.fit_transform(X_train, y_train)
X_new_test = transformer.transform(X_test)

# Hand-made transformation
bow = MultivariateTransformer(
    BagOfWords(window_size=20, word_size=4, n_bins=4, strategy='normal')
)
tfidf = TfidfVectorizer(norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True)

X_bow_train = bow.fit_transform(X_train)
X_bow_train = [append_suffix(X_bow_train[i], X_train.shape[1]) for i in range(X_train.shape[0])]
X_tfidf_train = tfidf.fit_transform(X_bow_train).A

# Reorder the columns to match the order of the columns of BagOfWordsTfidfTransformer
# First order on the feature (suffix), then on the word without the suffix
vocabulary = np.vectorize(
    {value: key for key, value in tfidf.vocabulary_.items()}.get
)(np.arange(X_tfidf_train.shape[1]))
ind = np.where(vocabulary[None, :] == transformer.vocabulary_[:, None])[1]
assert np.all(vocabulary[ind] == transformer.vocabulary_)
X_tfidf_train = X_tfidf_train[:, ind]

X_bow_test = bow.transform(X_test)
X_bow_test = [append_suffix(X_bow_test[i], X_test.shape[1]) for i in range(X_test.shape[0])]
X_tfidf_test = tfidf.transform(X_bow_test).A[:, ind]

# Check that the arrays are equal
assert np.all(X_tfidf_train == X_new_train)
assert np.all(X_tfidf_test == X_new_test)

Ahh your input about dimensions and array sizes make sense, (indeed my background is not computer science :) )

I also happen to be dealing with an imbalanced classification problem, so now that I have a baseline I wanted to use the pipeline for imbalanced-learn to run some experiments. Now I have a choice when to upsample the minority class. Before or after the bow-tfidf transformation. If I just do oversampling with replacement I don't think it matters where I resample before/after because the same instance is going to be transformed the same way.

If I arange the pipeline like this, with the oversampling before the bowtfidf transformer, it doesn't run, I get this error:
ValueError: Found array with dim 3. Estimator expected <= 2. because imblearn expects (n_samples, n_features) as input, yet I have (n_samples, n_features, time_steps )

from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline

sampler = RandomOverSampler(random_state=0, sampling_strategy=0.2)
pipe_imb = ImbPipeline(steps=[('oversample', sampler), ('bowtfidf', transformer), ('rfc', rfc)])

scoring = ['accuracy', 'recall', 'precision']
scores = cross_validate(pipe_imb, X_train_val, y_train_val, scoring=scoring, cv=5, verbose=5)

but for simple oversampling as I say I think it doesn't matter.
If I rearange the pipeline like so:

pipe_imb = ImbPipeline(steps=[('bowtfidf', transformer), ('oversample', sampler), ('rfc', rfc)])

This works. But if I wanted to generate slightly different learning instances (something akin to SMOTE but for timeseries - I don't even know yet if it exists :D ) I should do it before the transformation. Do you have some experience with this?

Slightly off topic, are you a teacher/professor somewhere? Your explanations are very to-the-point!

If you just perform oversampling, it doesn't matter as you said. Also I don't think that you need oversampling if you use a scikit-learn classifier because they all have a class_weight parameter (see the glossary) that allows to give different weights to samples (depending on the class) in the loss function, which is almost identical to performing oversampling.

But if I wanted to generate slightly different learning instances (something akin to SMOTE but for timeseries - I don't even know yet if it exists :D ) I should do it before the transformation. Do you have some experience with this?

There's no such tool in pyts and didn't find anything like that in other Python packages for time series (sktime and tslearn). I think that SMOTE on raw time series is likely suboptimal since it treats features as if they were independent (SMOTE performs a convex combination of several samples to generate new ones if I recall correctly). I think that, for time series (and more generally structured data such as images), it's probably better to find a space in which standard Euclidean stuff works well. I don't know if you are a bit familiar with artificial neural networks, but auto-encoders and generative adversarial networks basically consist in finding a good latent space to represent your data (to generate new samples).

Back to time series, if you don't want to bother with complex black box deep learning approaches, I would say that you could investigate the discrete Fourier transform as the latent space:

  1. Transform your data set of time series into a data set of discrete Fourier coefficients
  2. Perform SMOTE on the data set of discrete Fourier coefficients to generate new Fourier coefficients
  3. Transform the new Fourier coefficients to time series (here is an example from the documentation with the code to perform the inverse transformation)

This should work for a dataset of univariate time series (but I don't know if it's good). For multivariate time series, it's probably trickier because of the correlation between the features. A simple solution would to use the same convex combination for all the features. It may be possible to achieve this by applying SMOTE to each univariate time series dataset and setting the random_state to the same value (an integer for instance) for all the instances, but I'm not sure. I will have a look at it.

Edit: SMOTE computes the nearest neighbors of the minority samples, so you could get different neighbors for different features. Maybe an approach would be to stack the Fourier coefficients columns (as you do with the words), use SMOTE on this 2D dataset to generate the new Fourier coefficients for all the features at once, then go back to multivariate time series.

Slightly off topic, are you a teacher/professor somewhere? Your explanations are very to-the-point!

Thank you! I'm a postdoctoral researcher working on something totally different and my work on machine learning for time series is a side project (there's a bit more information on my website: https://johannfaouzi.github.io). I would like to become an associate professor at a university in the near future :)

Here is an example:

from imblearn.over_sampling import SMOTE
import numpy as np
from pyts.datasets import load_basic_motions
from pyts.multivariate.transformation import MultivariateTransformer
from pyts.approximation import DiscreteFourierTransform

# Disable warnings
import warnings
warnings.filterwarnings('ignore')


# Toy dataset
X, _, y, _ = load_basic_motions(return_X_y=True)
n_samples, n_features, n_timestamps = X.shape

# Perform discrete Fourier transform
dft = MultivariateTransformer(DiscreteFourierTransform())
X_dft = dft.fit_transform(X).reshape(n_samples, -1)

# Perform over-sampling with SMOTE
smote = SMOTE(sampling_strategy={'Standing': 42, 'Running': 79, 'Walking': 69, 'Badminton': 420},
              random_state=42)
X_dft_resampled, y_resampled = smote.fit_resample(X_dft, y)
n_samples_resamples = X_dft_resampled.shape[0]

# Perform inverse discrete Fourier transform
X_dft_resampled = X_dft_resampled.reshape(n_samples_resamples, n_features, n_timestamps)

if n_timestamps % 2 == 0:
    real_idx = np.arange(1, n_timestamps, 2)
    imag_idx = np.arange(2, n_timestamps, 2)
    X_new = []
    for i in range(n_features):
        X_new.append(np.c_[
            X_dft_resampled[:, i, :1],
            X_dft_resampled[:, i, real_idx] + (
                1j * np.c_[X_dft_resampled[:, i, imag_idx], np.zeros((n_samples_resamples, ))]
            )
        ])
else:
    real_idx = np.arange(1, n_timestamps, 2)
    imag_idx = np.arange(2, n_timestamps + 1, 2)
    X_new = []
    for i in range(n_features):
        X_new.append(np.c_[
            X_dft_resampled[:, i, :1],
            X_dft_resampled[:, i, real_idx] + 1j * X_dft_resampled[:, i, imag_idx]
        ])

X_dft_resampled = np.transpose(X_new, (1, 0, 2))
X_resampled = np.fft.irfft(X_dft_resampled, n_timestamps)

Then you can plot the training time series and some generated time series:

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 12))

# Define the indices for the training time series for the 'Walking' class
training_idx = np.arange(20, 30)

# Pick an index for a generated time series for the 'Walking' class
generated_idx = -5

for i in range(n_features):
    # Plot the training time series
    plt.subplot(6, 2, 2 * i + 1)
    for x in training_idx:
        plt.plot(X[x, i], color=f'C{i}')
        ylim = plt.ylim()
    
    # Plot a generated time series
    plt.subplot(6, 2, 2 * i + 2)
    plt.plot(X_resampled[generated_idx, i], color=f'C{i}')
    plt.ylim(ylim)

which gives you the following figure:
Capture d’écran 2021-09-07 à 12 06 12

I can't really tell if the generated time series make much sense, it may require some domain knowledge.

Just a few obvious comments on resampling: it should only be performed on the training set, which means that:

  • you must perform your train/test split before applying resampling (also the case for the training/validation split with grid search); imbalanced-learn / scikit-learn tools for pipeline and grid search come handy.
  • you must not resample the test set: you don't want to evaluate the performance of your classifier on generated samples and you want the class distribution in your test set to be the same as the expected class distribution in your real-life applications, so if your use case leads to imbalanced classes, you also want imbalanced classes in your test set.

I took a small brake, and I am now in the final leg of this project. :)

I wanted to ask you about the symbol selection for constant subsequences. Is it random in pyts ? I'm continuing with using the transformer class you kindly provided and I get some interesting results with decision trees. When I plot one such tree I see that the top level node and the first node of the following level both belong to the same signal (as indicated by the subscript) and both are of constant subsequence. What surprises me is that they both are important for classification since they're at the top of the tree. I thought (but maybe I misunderstood our initial exchange) that aa and bb for one signal are essentially the same thing, but if so how could they be so meaningful for classification?
image

EDIT. and if there is no meaningful difference between aa and bb can I somehow force it to only use one of them?

I'm going to assume that you are using the quantiles of the standard normal distribution to define the bin edges (i.e., strategy='quantiles') since it's the default value and it's what is used in the original description of the paper, but the same reasoning can be applied to the other strategies to compute the bin edges.

Based on what you pointed it out in saxpy, there is now a threshold_std parameter that controls whether subsequences are standardized (zero mean, unit variance) or not: subsequences whose standard deviation is below the threshold are not standardized.

If this option is enabled, then different constant subsequences may lead to different words in order to take into account the constant value. For instance, with 3 symbols, the bin edges are:

Interval Symbol
(-∞, -0.430727...] 'a'
(-0.430727, 0.430727] 'b'
(0.430727, ∞) 'c'

So (-1., -1., -1.) would yield 'aaa', (0., 0., -0.) would yield 'bbb' and (1., 1., 1.) would yield 'ccc'.

If this option is disabled, then all the subsequences are standardized. Constant subsequences will be transformed into a vector full of zeros (dividing by the standard deviation is skipped). In this case, all the constant subsequences will be mapped to the same word (with the same symbol). For instance, with the previous examples, the three subsequences would be transformed into 'bbb'.

If you are using the default value (threshold_std = 0.01), then the option is enabled, and thus different constant subsequences may yield different words (depending on the constant values). The previous version of the code (the one when you opened this issue) did not have this option (and it was disabled).

Let me know if this is clearer or not!

Right indeed you are! This makes sense, I remember now :)

Lastly, I wanted to visualise the relevant patterns I find with pyts on my original timeseries. I've seen the example in the documentation, which shows the transformation of the whole time series and it's close to what I'm looking for. But instead of the whole transformation I just want the relevant windows/sub-sequences (which I get from my classifier). So if in a particular instance (timeseries) I wish to visualize just the instances of the pattern abc how best to do it?

In saxpy when the patterns are found it also returns the indices where they are found in the array, so it makes it easy to retrieve them. What approach could you recommend with pyts ?

I actually managed to use your example code to achieve what I want, I had misunderstood the example. If I provided the feature of interest as input to your code and change the step size to what I use it worked. Also changing numerosity reduction to True. Then I get this which corresponds with the patterns found in that signal db ea db bb

image_2021-09-28_132639

import matplotlib.pyplot as plt
import numpy as np
from pyts.bag_of_words import BagOfWords

# Load the dataset and perform the transformation

X = x_new[:, 0, :]  # Here I'm selecting only the feature I'm interested in
window_size, word_size = 12, 2
bow = BagOfWords(window_size=window_size, word_size=word_size, n_bins=5,
                 window_step=1, numerosity_reduction=True)
X_bow = bow.transform(X)

# Plot the considered subseries
plt.figure(figsize=(10, 4))
splits_series = np.linspace(0, X.shape[1], 1 + X.shape[1] // window_size,
                            dtype='int64')
for start, end in zip(splits_series[:-1],
                      np.clip(splits_series[1:] + 1, 0, X.shape[1])):
    plt.plot(np.arange(start, end), X[0, start:end], 'o-', lw=1, ms=1)

# Plot the corresponding letters
splits_letters = np.linspace(0, X.shape[1],
                             1 + word_size * X.shape[1] // window_size)
splits_letters = (
    (splits_letters[:-1] + splits_letters[1:]) / 2).astype('int64')
for i, (x, text) in enumerate(zip(splits_letters, X_bow[0].replace(' ', ''))):
    plt.text(x, X[0, x] + 0.1, text, color="C{}".format(i // 5), fontsize=14)

# plt.ylim((-1, 2.2))
plt.xlabel('Time', fontsize=12)
plt.title('Bag-of-words representation for time series', fontsize=16)
plt.tight_layout()
plt.show()

UPDATE. I did notice that for increased wordsize I get more patterns than in the photo, any idea why? So for these settings BagOfWords(word_size=4, n_bins=5, window_size=12, window_step=1, numerosity_reduction=True) for a particular instance of length 50 I get these words: edba edca edba edaa ecbb ebbb bbbb. But when I plot, I don't get the last two, the bbb part. I see that in the splits_series bit we divide by the window_size to get the appropriate amount of words for that instance, but how does it work then when the window is overlapping? Looking at my images again the bbb should probably be at the end but isn't shown due to the overlap. So this then maybe goes back to my previous point of whether it's possible to retrieve the indices of the found words.
image_2021-09-28_140856

The location of the symbols is wrong. I think that you may have misunderstood how the subsequences are extracted. Nonetheless, updating the code to get what you're looking for was a great idea!

If you look at your time series, you can see that it is (almost) constant starting at time point 11. Therefore, all the extracted subsequences whose start index is greater than or equal to 11 will be transformed into the bbb word (and a single bbb word is output with numerosity reduction). To get the indices of the words, it is better (only for visualization purposes) to disable numerosity reduction.

I recently submitted a review paper on time series classification and coded a function to illustrate another algorithm (pyts.transformation.BOSS) that is very similar to what you're doing (except that the words are extracted using the pyts.approximation.SymbolicFourierApproximation instead of PAA+SAX). The resulting figure is displayed below:

boss

with the following caption:

  • (a) just displays the raw time series.
  • (b) highlights all the extracted subsequences.
  • (c) shows the corresponding word for each subsequence (words with high transparency are discarded with numerosity reduction).
  • (d) shows the final transformation (i.e., the histogram with the word frequencies).

Here is an example with BagOfWords for a single feature (some display parameters may require some changes depending on the values of the raw time series):

import matplotlib.pyplot as plt
import numpy as np
from pyts.bag_of_words import BagOfWords
from pyts.datasets import load_basic_motions
from sklearn.feature_extraction.text import CountVectorizer


# Display stuff
x_coord_caption_subfigure = 0.00
y_coord_caption_subfigure = 1.04

# Data
X, _, _, _ = load_basic_motions(return_X_y=True)
X = X[:, 0, :]

# Parameters
window_size = 10
window_step = 1
word_size = 2
n_bins = 5
n_windows = (X.shape[1] - window_size + window_step) // window_step

fig, axes = plt.subplots(4, 1, figsize=(12, 8), gridspec_kw={'hspace': 0.5})

# Plot raw time series
idx = 0
axes[0].plot(X[idx])
xlim = axes[0].get_xlim()
ylim = axes[0].get_ylim()

axes[0].text(x_coord_caption_subfigure, y_coord_caption_subfigure,
         '(a)', fontweight='bold', va='bottom', ha='left', transform=axes[0].transAxes)

# Plot subsequences
def offset(x_sub, i):
    return - (i % (window_size + 5)) / 4 + (1.0052278 - x_sub.max())

for i in range(n_windows):
    x_sub = X[idx, i:i + window_size]
    axes[1].plot(np.arange(i, i + window_size), x_sub + offset(x_sub, i), linewidth=0.5)
axes[1].set_ylim((ylim[0] - 2, ylim[1]))
axes[1].set_axis_off()

axes[1].text(x_coord_caption_subfigure, 0.95,
         '(b)', fontweight='bold', va='bottom', ha='left', transform=axes[1].transAxes)

# Plot SFA words with numerosity reduction
bow = BagOfWords(window_size=window_size, word_size=word_size, n_bins=n_bins,
                 window_step=window_step, numerosity_reduction=False)
X_bow = bow.transform(X)
X_word = np.asarray([X_bow[i].split(' ') for i in range(X.shape[0])])
not_equal_idx = np.c_[np.full(X.shape[0], True), X_word[:, 1:] != X_word[:, :-1]]

x_word = X_word[idx]
x_bow = X_bow[idx]

for i in range(n_windows):
    alpha = 1. if not_equal_idx[idx, i] else 0.2
    fontweight = 'bold' if not_equal_idx[idx, i] else 'normal'
    axes[2].text(i, 1.2 - (i % (window_size + 5)) / 10, x_word[i],
                 color=f'C{i}', alpha=alpha, fontsize=6, fontweight=fontweight)

axes[2].set_xlim(xlim)
axes[2].set_axis_off()

axes[2].text(x_coord_caption_subfigure, 1.26,
             '(c)', fontweight='bold', va='bottom', ha='left', transform=axes[2].transAxes)

# Plot histogram
bow.set_params(numerosity_reduction=True)
X_bow_with_numerosity = bow.transform(X)

count = CountVectorizer()
X_hist = count.fit_transform(X_bow_with_numerosity).A

vocabulary = np.array(sorted(count.vocabulary_.keys()))
axes[3].bar(np.arange(np.sum(X_hist[idx] > 0)), X_hist[idx][X_hist[idx] > 0])
axes[3].set_xticks(np.arange(np.sum(X_hist[idx] > 0)))
axes[3].set_xticklabels(vocabulary[X_hist[idx] > 0])

axes[3].text(x_coord_caption_subfigure, y_coord_caption_subfigure,
         '(d)', fontweight='bold', va='bottom', ha='left', transform=axes[3].transAxes)

plt.show()

which gives the following figure:

temp