johannfaouzi / pyts

A Python package for time series classification

Home Page:https://pyts.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question on the Interpretability of TSBF

aledcuevas opened this issue · comments

Hello, I have a question about the interpretability of the TSBF model. Broadly, I want to understand what specific subsequences or intervals most contribute to the predictive power.

When accessing the TSBF estimator, we are able to inspect the features and feature importances of the second Random Forest Classifier (RFC). These features are determined by (n_bins + 1) * n_classes, where the order of features seems to be mapped something like.

Let k be the number of bins, and j the number of classes, we have:
bins(i to k) for j0 class, mean proba for j0 class, bins(i to k) for j1 class, mean proba for j1 class, etc.

Each of these bins is derived from features derived from the subsequences and intervals for each time-series sample. I'm wondering whether there's a way to understand what are the features from each subsequence that are most important. For instance, are there specific time periods from the subsequence which are most useful in prediction? As of now, I'm also lost on how to interpret each of the bins. What does it mean for a bin_i_k to have a high feature importance?

What I've tried/inferred so far

Given a TSBF, we can access the interval indices. In my case, I have interval indices of shape 6x4, where 6 is the nr_subsequences/subseries and 4 is the nr_intervals. These interval_indices then get used to compute (start, end) pairs (a total of 18 pairs for an array of shape 6x4), which then are used to compute statistics for the subsequences (4 stats) and intervals (3 stats). These are returned as X_features, and the transformation should yield X_new : array, shape = (n_samples * n_subseries, 3 * n_intervals + 4). Given that I'm working with 450 samples, I should get X.shape = (4506, 34+4) = (2700,16). This X_new is used to train a random forest RF.

We can access the estimators within the TSBF, which are an ensemble of trees (i.e., random forest). Each of these trees have (n_bins+1)*n_classes as features. And what I'm trying to understand is, from that X_new which was extracted, what are the useful subsequences? What do each of these bins map to?

Hi,

Your understanding of TSBF looks right to me.

Indeed TSBF is hard to interpret because it consists of several steps:

  1. One starts with a dataset X of n_samples time series, each of length n_timestamps: X.shape = (n_samples, n_timestamps)
  2. TSBF first extracts n_subsequences subsequences from each time series, and each subsequence is also split into n_intervals subintervals. From each subinterval, 3 features are extracted (mean, standard deviation, slope). From each subsequence, 4 features are extracted (mean, standard deviation, start index, end index). As you said, we obtain a new dataset with shape (n_samples * n_subsequences, 3 * n_intervals + 4). Each row is a subsequence and each column is a feature.
  3. TSBF then fits a first random forest classifier on this dataset. Each row correspond to a subseries/subsequence and its label corresponds to the label from the original time series).
  4. TSBF computes the out-of-bag probabilities for each subsequence of belonging to each class. The new dataset is of shape (n_samples * n_subsequences, n_classes). Each row is a subsequence and each column is the out-of-bag probability of belonging to a class.
  5. The next idea is to "reduce" the probabilities for all the subsequences from a single time series. Instead of having a single probability (by using the mean probability for instance), the histogram of the probabilities is computed. But we also compute the mean probability. This leads to the final dataset with shape (n_samples, (n_bins + 1) * n_classes).
  6. A random forest classifier is finally fitted on this dataset.

So, if we want to perform "reverse engineering", a feature in the final dataset is:

  • either the mean probability of belonging to a class (mean computed over all the subsequences), or
  • the number of subsequences whose probabilities of belonging to a class are in a given interval (e.g., [0.8, 1.0]).

The big issue for interpretability is that, with the reduction functions used (mean and histogram), we lose a lot of "spatial" information:

  • With the mean, we lose all the "spatial" information.
  • With the histogram, we lose a lot of "spatial" information: we can retrieve which subsequences fell into which bins, but that's it. If one feature is considered important, but corresponds to a bin in which there are many subsequences, then we can't differentiate them. Another issue with the loss of "spatial" information is that, for a given subinterval, the subsequences extracted from time series A and time series B might have different probabilities and fall in different bins when computing the histograms. So you cannot really say that this subinterval is important, but that this / these subsequence(s) (i.e., this interval for this / these time series) are possibly important.

And then, you would need to interpret the first random forest classifier...

Or you can try to only use the first random forest for the interpretation (which may be a decent approximation): its features are much easier to interpret (simple feature extracted from a given interval). And you assume that a feature that is important to classify subsequences is probably a good feature to classify whole time series, and you leave the "aggregation" step out of the interpretation.

Hope this helps you a bit, but I think that TSBF is way too complex for a simple interpretation.