BOSSVS not working with a single feature

Question

BOSSVS not working with a single feature

roman-4erkasov opened this issue a year ago · comments

Description

Class pyts.classification.BOSSVS doesn't accept timeseries of one feature.
Advices from the error message doesn't help, but leads to another error.

Steps/Code to Reproduce

I tried all the three possible versions to use timeseries of one feature:

The first version is just 1D-array:

import numpy as np
from pyts.classification import BOSSVS
x_train = np.random.uniform(low=0.0, high=10.0, size=(300,))
y_train = np.random.randint(low=0, high=2, size=(300,))
est = BOSSVS().fit(x_train, y_train)

This example gives:

ValueError: Expected 2D array, got 1D array instead:
array=[1.84916215 7.16606073 5.69089018 ... ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

After that I tried the following code

import numpy as np
from pyts.classification import BOSSVS
x_train = np.random.uniform(low=0.0, high=10.0, size=(300,))
y_train = np.random.randint(low=0, high=2, size=(300,))
est = BOSSVS().fit(x_train.reshape(-1, 1), y_train)

It gives the following error:
ValueError: If 'window_size' is an integer, it must be greater than or equal to 1 and lower than or equal to n_timestamps if 'drop_sum=False'.

Finally I tried the following code:

import numpy as np
from pyts.classification import BOSSVS
x_train = np.random.uniform(low=0.0, high=10.0, size=(300,))
y_train = np.random.randint(low=0, high=2, size=(300,))
est = BOSSVS().fit(x_train.reshape(1, -1), y_train)

It gives the following error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 300]

Versions

NumPy 1.23.4
SciPy 1.9.3
Scikit-Learn 1.2.0
Numba 0.56.4
Pyts 0.12.0

Thank You!

Johann Faouzi · Answer 1 · Mon Mar 06 2023 23:54:10 GMT+0800 (China Standard Time)

Sorry for the delayed response.

In order to train a classification algorithm, one needs several samples. In our case, a sample is a time series. This set of training samples is often called the training set.

The expected format for the training set is similar to the one used in scikit-learn (if you are familiar with it):

X_train is a 2D-array with shape (n_samples, n_timestamps): the first dimension corresponds to the samples (time series), while the second dimension corresponds to the time.
y_train is a 1D-array with shape (n_samples,): it contains the label associated with each sample (time series).

The format is identical for the test set.

Let's load a toy dataset to illustrate this:

>>> from pyts.datasets import load_gunpoint
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> X_train.shape
(50, 150)  # there are 50 time series, each with 150 values.
>>> y_train.shape
(50,)  # there are 50 labels because there are 50 time series in the training set.
>>> y_train
array([2, 2, 1, 1, 2, ...])  # there are 2 labels (denoted as the integers 1 and 2).
>>> X_test.shape
(150, 150)  # there are 150 time series, each with 150 values.
>>> y_test.shape
(150,)  # there are 150 labels because there are 150 time series in the test set.

Now, one can perform classification using BOSSVS on this dataset:

>>> from pyts.classification import BOSSVS
>>> clf = BOSSVS()
>>> clf.fit(X_train, y_train)
BOSSVS()
>>> clf.score(X_test, y_test)
0.82  # accuracy score of 0.82 on the test set

Back to your example, I don't understand your data. It seems that you have 300 time series, but each time series has a single value. You cannot use BOSSVS with such data. You cannot do any time series analysis if the time series have a single value. It probably does not make sense to consider this kind of data as time series.

Hope this helps you a bit and I would be happy to give you more info if needed, but I'm not sure to understand your data.