TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis

Home Page:https://stumpy.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some issues in Tutorial_Multidimensional_Motif_Discovery and MDL

NimaSarajpoor opened this issue · comments

I have been looking for different documents to better understand MDL, and I came across this tutorial notebook which explains Multidimensional Motif Discovery. I discovered a few issues:

(1) The locations of co-motifs do not match

According to Fig. 2 in Matrix Profile VI, the locations of motifs in the first two dimensions are the same. Personally, I call it co-motifs, i.e. motif pair (A, A') in one dimension and (B, B') in another dimension starts at the same index. (Also: see Definition 11).

The toy data provided in the notebook, however, does not result in matching indices for the motifs in the first two dimensions.

(2) When I set normalize=True everything is good. But, if I set the last value of time series in the last dimension to 1000, I get inconclusive result if I use MDL.

# df is the toy data used in the tutorial notebook

df.iloc[-1, 2] = 1000

normalize=True
mps, indices = stumpy.mstump(df, m, normalize=normalize)
motifs_idx = np.argmin(mps, axis=1)
nn_idx = indices[np.arange(len(motifs_idx)), motifs_idx]

mdls, subspaces = stumpy.mdl(df, m, motifs_idx, nn_idx, normalize=normalize)

And, I will see this plot when I want to visualize the MDL results:

image

In this case, the minimum is at index 2. However, we know that this is not correct. It is interesting that the elbow still indicates the correct result:
image

(3) Let's set normalize to False again. Also, let's scale the time series in the dim 0, 1, 2 by 1000, 100, 10, respectively.

# df is toy data
df.iloc[0, :] = df.iloc[0, :] * 1000
df.iloc[1, :] = df.iloc[1, :] * 100
df.iloc[2, :] = df.iloc[2, :] * 10
normalize=True
mps, indices = stumpy.mstump(df, m, normalize=normalize)

motifs_idx = np.argmin(mps, axis=1)
nn_idx = indices[np.arange(len(motifs_idx)), motifs_idx]

And I get this:

>>> motifs_idx
array([ 65, 152, 151])

>>> nn_idx
array([477, 352, 351])

But I was expecting to get the same index for the first two dimensions. In this case, I think the reason is that we are just adding the distances across dimensions. see:
https://github.com/EitanHemed/stumpy/blob/d569c9adbb5f4fd3ba018661a78ac80cbb2d5808/stumpy/core.py#L3999-L4001

While this can make sense when normalize=True, it may not be appropriate to just add them together (but I do understand that we probably do not have any other choice here). Note that if we apply matrix profile on each dimension individually, we get correct answer (still, issue (1) exists). However, if we just apply multi-dim matrix profile, we get strange result because the scale of time series are not the same, and it affects the result when normalize==False.

Maybe it is not an issue(?!) but still I expected to get correct answer since applying metrix profile on each time series reveals co-motifs in the first two dimensions. So, maybe we just add a note in the doctoring saying that it is better to normalize the WHOLE time series in EACH dimension first before passing it to mstump(...., normalize=False)

Also, not sure if it is related to issue (2) mentioned in previous post or not... but I feel doing

https://github.com/EitanHemed/stumpy/blob/d569c9adbb5f4fd3ba018661a78ac80cbb2d5808/stumpy/maamp.py#L275-L281

may not be entirely correct because we are finding the min and max considering ALL values across ALL dimensions. Shouldn't we compute the min / max of each dimenison separately?

And, I will see this plot when I want to visualize the MDL results:

I don't know, I followed what you did and got the same MDL results both times (without setting the last value to 1000 and then with setting the last value to 1000):

%matplotlib inline

import pandas as pd
import numpy as np
import stumpy
import matplotlib.pyplot as plt

plt.style.use('https://raw.githubusercontent.com/TDAmeritrade/stumpy/main/docs/stumpy.mplstyle')

df = pd.read_csv("https://zenodo.org/record/4328047/files/toy.csv?download=1")
m = 30
mps, indices = stumpy.mstump(df, m)
motifs_idx = np.argmin(mps, axis=1)
nn_idx = indices[np.arange(len(motifs_idx)), motifs_idx]
mdls, subspaces = stumpy.mdl(df, m, motifs_idx, nn_idx)

print(mdls)

df.iloc[-1, 2] = 1000
mps, indices = stumpy.mstump(df, m)
motifs_idx = np.argmin(mps, axis=1)
nn_idx = indices[np.arange(len(motifs_idx)), motifs_idx]
mdls, subspaces = stumpy.mdl(df, m, motifs_idx, nn_idx)

print(mdls)

and it prints:

array([1519.70685868, 1505.25177862, 1801.94802714])
array([1519.70685868, 1505.25177862, 1801.94802714])

This produces k=1, which is correct.

(3) Let's set normalize to False again

Are you sure? In the code that you provided, normalize=True (not False).

Wait, why are you doing this?

df.iloc[0, :] = df.iloc[0, :] * 1000
df.iloc[1, :] = df.iloc[1, :] * 100
df.iloc[2, :] = df.iloc[2, :] * 10

This takes the first row of all three time series and multiplies it by 1000. Then it takes the second row in all 3 time series and multiplies it by 100. Finally, it takes the 3rd row in all 3 time series and multiplies it by 10. Is this what you really want? I would've thought that you wanted to multiply ALL values of the first time series by 1000 and so on and so forth.

Shouldn't we compute the min / max of each dimenison separately?

Without having put too much thought into it, I don't think so. If you apply min/max to each dimension separately then you'd be discretizing each dimension using completely independent functions. I believe that the same discretization function must be performed uniformly across all dimensions using the same max/min.

@seanlaw

I don't know, I followed what you did and got the same MDL results both times (without setting the last value to 1000 and then with setting the last value to 1000):

Oops, my bad! The title of the second part should have been:

When I set normalize=False, everything is good. But....

So, I meant when normalize is False AND the last element of time series in the last dim is set to 1000. Would you mind trying this?

import numpy as np
import pandas as pd

import stumpy

df = pd.read_csv("https://zenodo.org/record/4328047/files/toy.csv?download=1")
df.iloc[-1, 2] = 1000

normalize=False
m = 30

mps, indices = stumpy.mstump(df, m, normalize=normalize)
motifs_idx = np.argmin(mps, axis=1)
nn_idx = indices[np.arange(len(motifs_idx)), motifs_idx]

mdls, subspaces = stumpy.mdl(df, m, motifs_idx, nn_idx, normalize=normalize)
proper_k = np.argmin(mdls)

print(f'proper_k --> {proper_k}')
print(f'dims --> {subspaces[proper_k]}')

Are you sure? In the code that you provided, normalize=True (not False).
Wait, why are you doing this?

Again... my bad! That should have been False, and I meant:

df.iloc[:, 0] = 1000 * df.iloc[:, 0]
df.iloc[:, 1] = 100 * df.iloc[:, 1]
df.iloc[:, 2] = 10 * df.iloc[:, 2]

Would you mind trying this?

import numpy as np
import pandas as pd

import stumpy

df = pd.read_csv("https://zenodo.org/record/4328047/files/toy.csv?download=1")
df.iloc[:, 0] = df.iloc[:, 0] * 1000
df.iloc[:, 1] = df.iloc[:, 1] * 100
df.iloc[:, 2] = df.iloc[:, 2] * 10

normalize=False
m = 30

mps, indices = stumpy.mstump(df, m, normalize=normalize)
motifs_idx = np.argmin(mps, axis=1)
nn_idx = indices[np.arange(len(motifs_idx)), motifs_idx]


print(f'motifs_idx --> {motifs_idx}')
print(f'nn_idx --> {nn_idx}')

mdls, subspaces = stumpy.mdl(df, m, motifs_idx, nn_idx, normalize=normalize)
proper_k = np.argmin(mdls)

print(f'proper_k --> {proper_k}')
print(f'dims --> {subspaces[proper_k]}')

@seanlaw
Next time, I will try an end-to-end code on my end and provide that full script to avoid such mistakes. Apologies for the inconvenience.

So, I meant when normalize is False AND the last element of time series in the last dim is set to 1000. Would you mind trying this?

Okay, I am able to reproduce it now. In both cases, I think the issue stems from the fact that one or more of the time series have a significantly larger/different min/max range, which then affects the MDL modeling. Essentially, in our current implementation, we are basically assuming that the data from each time series are being sampled from the same distribution (e.g., all of the time series come from three different thermostats that are sitting in the same room). However, it's possible that you have three time series that are collecting values from different distributions (e.g., all three time series are in the same room but one is measuring temperature, a second is measuring the pressure, and a third is measuring the amount of CO2 gas). In the first case, it's likely "okay" to use the default discretization function. However, in the latter case, it might not make any sense and, instead, the user should specify their own discretization function (via discretize_func in stumpy.mdl).

Perhaps, your questions is whether or not there is a "smarter" way to either warn the user that the default discretization function may be bad/insufficient for their data and/or maybe there is a better default discretization function?

Shouldn't we compute the min / max of each dimenison separately?

Yes and no. It depends. If the scale of all of the time series are unrelated, then "yes". If the scale of all of the time series are related (as in the former case above), then "no".

I think the reason is that we are just adding the distances across dimensions

Again, this is fine in the former case above but not fine for the latter case. When normalize=False and you have the latter case above, what might be "better" than adding the distances together? Without making too many assumptions, I don't know 😢. Do you?

e are basically assuming that the data from each time series are being sampled from the same distribution (e.g., all of the time series come from three different thermostats that are sitting in the same room).

Yes... I noticed it after playing with data and checking out the results.

Perhaps, your questions is whether or not there is a "smarter" way to either warn the user that the default discretization function may be bad/insufficient for their data and/or maybe there is a better default discretization function?

Yeah, and I was hoping to see if it helps me with finding some solution for #942. My main reason behind creating this issue was to dig a little bit deeper and see if it is possible to consider customize offset T_X_add for the subsequences (as discussed in #942) and discretize them properly.

I think the reason is that we are just adding the distances across dimensions
Again, this is fine in the former case above but not fine for the latter case. When normalize=False and you have the latter case above, what might be "better" than adding the distances together? Without making too many assumptions, I don't know 😢. Do you?

You are right. Sadly I don't know either.