BOSSVS not working with a single feature
roman-4erkasov opened this issue · comments
Description
Class pyts.classification.BOSSVS doesn't accept timeseries of one feature.
Advices from the error message doesn't help, but leads to another error.
Steps/Code to Reproduce
I tried all the three possible versions to use timeseries of one feature:
- The first version is just 1D-array:
import numpy as np
from pyts.classification import BOSSVS
x_train = np.random.uniform(low=0.0, high=10.0, size=(300,))
y_train = np.random.randint(low=0, high=2, size=(300,))
est = BOSSVS().fit(x_train, y_train)
This example gives:
ValueError: Expected 2D array, got 1D array instead:
array=[1.84916215 7.16606073 5.69089018 ... ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
- After that I tried the following code
import numpy as np
from pyts.classification import BOSSVS
x_train = np.random.uniform(low=0.0, high=10.0, size=(300,))
y_train = np.random.randint(low=0, high=2, size=(300,))
est = BOSSVS().fit(x_train.reshape(-1, 1), y_train)
It gives the following error:
ValueError: If 'window_size' is an integer, it must be greater than or equal to 1 and lower than or equal to n_timestamps if 'drop_sum=False'.
- Finally I tried the following code:
import numpy as np
from pyts.classification import BOSSVS
x_train = np.random.uniform(low=0.0, high=10.0, size=(300,))
y_train = np.random.randint(low=0, high=2, size=(300,))
est = BOSSVS().fit(x_train.reshape(1, -1), y_train)
It gives the following error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 300]
Versions
NumPy 1.23.4
SciPy 1.9.3
Scikit-Learn 1.2.0
Numba 0.56.4
Pyts 0.12.0
Thank You!
Sorry for the delayed response.
In order to train a classification algorithm, one needs several samples. In our case, a sample is a time series. This set of training samples is often called the training set.
The expected format for the training set is similar to the one used in scikit-learn (if you are familiar with it):
X_train
is a 2D-array with shape(n_samples, n_timestamps)
: the first dimension corresponds to the samples (time series), while the second dimension corresponds to the time.y_train
is a 1D-array with shape(n_samples,)
: it contains the label associated with each sample (time series).
The format is identical for the test set.
Let's load a toy dataset to illustrate this:
>>> from pyts.datasets import load_gunpoint
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> X_train.shape
(50, 150) # there are 50 time series, each with 150 values.
>>> y_train.shape
(50,) # there are 50 labels because there are 50 time series in the training set.
>>> y_train
array([2, 2, 1, 1, 2, ...]) # there are 2 labels (denoted as the integers 1 and 2).
>>> X_test.shape
(150, 150) # there are 150 time series, each with 150 values.
>>> y_test.shape
(150,) # there are 150 labels because there are 150 time series in the test set.
Now, one can perform classification using BOSSVS on this dataset:
>>> from pyts.classification import BOSSVS
>>> clf = BOSSVS()
>>> clf.fit(X_train, y_train)
BOSSVS()
>>> clf.score(X_test, y_test)
0.82 # accuracy score of 0.82 on the test set
Back to your example, I don't understand your data. It seems that you have 300 time series, but each time series has a single value. You cannot use BOSSVS with such data. You cannot do any time series analysis if the time series have a single value. It probably does not make sense to consider this kind of data as time series.
Hope this helps you a bit and I would be happy to give you more info if needed, but I'm not sure to understand your data.