samples with different length
thunderbug1 opened this issue · comments
If I understand the WEASEL+MUSE algorithm correctly it should be possible to use it with samples of different lengths.
This is currently not possible with the API of the WEASELMUSE class which expects a 3d array in the shape = (n_samples, n_features, n_timestamps) since a numpy array has the same shape for all samples.
I tried to fill the time series of all samples to the length of the longest samples with nan values, but the input checks reject nan values.
Is there a way to achieve using samples of different lengths?
Hi,
Sorry for the late reply. Support for variable-length data sets is unfortunately not supported for the moment.
Regarding WEASEL+MUSE, you can achieve this with the following process:
- Create a data set for each unique length value (in each data set, the time series should have the same length)
- Transform each data set using a separate instance of WEASELMUSE (set
chi2_threshold
to a very low positive value in order to not perform feature selection) - Concatenate the transformed data set (the
pandas
package is handy for this) - Perform feature selection on the concatenated data set
The main downside of this approach is the high memory (RAM) usage because the feature selection is performed at the last step. A possible solution (that would lead to the same results) would be to use a for loop for the window_sizes
parameters (instead of setting a list with k
window sizes, you create a for loop (on the window sizes) and provide a single window size inside the for loop).
Here is an example (without the aforementioned optimization, I can modify the example to show you if needed):
import numpy as np
import matplotlib.pyplot as plt
from pyts.datasets import load_basic_motions
from pyts.multivariate.transformation import WEASELMUSE
import pandas as pd
from sklearn.feature_selection import chi2
#######################
####### D A T A #######
#######################
# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
# X_train.shape = X_test.shape = (40, 6, 100)
# Sample 4 random lengths between in the interval [80, 100]
rng = np.random.RandomState(42)
lengths = 80 + rng.choice(21, size=4, replace=False)
# Assign 10 time series to each length
lengths_samples_train_idx = rng.permutation(40).reshape((4, 10))
lengths_samples_test_idx = rng.permutation(40).reshape((4, 10))
#######################
# P A R A M E T E R S #
#######################
# WEASEL+MUSE parameters
weasel_muse_params = {'word_size': 2, 'n_bins':2, 'window_sizes': [12, 36],
'chi2_threshold': 1e-80}
transformer_list = [WEASELMUSE(**weasel_muse_params) for _ in range(4)]
#######################
### T R A I N I N G ###
#######################
X_weasel_train = []
for samples_idx, length, transformer in zip(lengths_samples_train_idx, lengths, transformer_list):
X_weasel_train.append(transformer.fit_transform(X_train[samples_idx, :, :length], y_train[samples_idx]))
# Concatenate the array as a DataFrame and fill NA values with 0
df_weasel_train = pd.concat([
pd.DataFrame.sparse.from_spmatrix(
X, index=samples_idx, columns=np.vectorize(transformer.vocabulary_.get)(np.arange(X.shape[1]))
)
for X, samples_idx, transformer in zip(X_weasel_train, lengths_samples_train_idx, transformer_list)
]).fillna(0.)
# Perform feature selection using chi2 test
chi2_threshold = 2.
chi2_statistics, _ = chi2(df_weasel_train, y_train)
features_idx_to_keep = np.where(chi2_statistics > chi2_threshold)[0]
features_to_keep = df_weasel_train.columns[features_idx_to_keep]
df_weasel_train = df_weasel_train[features_to_keep]
#######################
## I N F E R E N C E ##
#######################
X_weasel_test = []
for samples_idx, length, transformer in zip(lengths_samples_test_idx, lengths, transformer_list):
X_weasel_test.append(transformer.transform(X_test[samples_idx, :, :length]))
# Concatenate the array as a DataFrame and fill NA values with 0
df_weasel_test = pd.concat([
pd.DataFrame.sparse.from_spmatrix(
X, index=samples_idx, columns=np.vectorize(transformer.vocabulary_.get)(np.arange(X.shape[1]))
)
for X, samples_idx, transformer in zip(X_weasel_test, lengths_samples_test_idx, transformer_list)
]).fillna(0.)[features_to_keep]
Let me know if this helps you.
oh wow, thanks for the extensive example.
I wouldn't have considered using separate instances of WEASELMUSE but it makes sense.
I will give it a try