TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis

Home Page:https://stumpy.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ignore_trivial does not work as expected

frankl1 opened this issue Β· comments

Hi,

I am trying stumpy, I really love the library by the way.

I have the impression that the parameter ignore_trivial does not work as expected. Here is a minimum example.

import stumpy
steam_df = pd.read_csv("https://zenodo.org/record/4273921/files/STUMPY_Basics_steamgen.csv?download=1")
stumpy.stump(steam_df['water level'], m, ignore_trivial=True)

I expect this code not to exclude trivial matches, and therefore the mp should be 0 everywhere as this is a self-join. However, the result is the following:
image

Even more suprising, the result is the same if I set ignore trivial to False.
image

Clearly, something is not working as expected.

I am using Stumpy 1.9.2

Thanks

@frankl1 Thank you for your question and for your kind words. Here are a few observations as I work through the example:

  1. By default, ignore_trivial is automatically set to True for self-joins (i.e., if you don't supply a second time series, T_B, as an input parameter)
  2. For self-joins, ignore_trivial should ALWAYS be True. Otherwise, every subsequence will find itself, called the "self-match"/"trivial-match" (or some local neighboring subsequence next to the self-match) as its "one-nearest-neighbor".
  3. When ignore_trivial=True this does NOT mean "not to exclude trivial matches". In fact, ignore_trivial=True means to "disallow the nearest neighbor to exist within each subsequence's "exclusion zone" (usually within +/- m/2 surrounding the subsequence of interest).
  4. Overriding or disabling ignore_trivial for self-joins (by setting ignore_trivial=False) is not advised and you are running into dangerous territory that is not supported (i.e., you are on your own if you decide to go down this path). According to the original authors of the matrix profile, the "self-match"/"trivial-match" is meaningless/obvious since you know that a subsequence should match itself and you don't need to compute a matrix profile to know this. Currently, there is no way to get around this for a self-join (not even by setting from stumpy import config; config.STUMPY_EXCL_ZONE_DENOM=np.inf; as this will allow any subsequence to match except for the self-match)
  5. STUMPY is already at version 1.12.0 and so I recommend updating if that is feasible (there have been some bug fixes)

Having said all of that, there will be extreme/exceptionally rare circumstances where one might want/need to allow self-match/trivial-match and here is a (bad, very bad, don't do this as this is not supported) solution:

import stumpy
import pandas as pd
steam_df = pd.read_csv("https://zenodo.org/record/4273921/files/STUMPY_Basics_steamgen.csv?download=1")
m = 640

stumpy.stump(steam_df['water level'], m, steam_df['water level'], ignore_trivial=False)

which gives:

array([[0.0, 0, -1, -1],
       [0.0, 1, -1, -1],
       [0.0, 2, -1, -1],
       ...,
       [1.2502775766493696e-06, 8958, -1, -1],
       [1.5542989743239043e-06, 8959, -1, -1],
       [1.1920928955078125e-06, 8960, -1, -1]], dtype=object)

So, all of the distances are zero (please mind the limited precision) and the index for the nearest neighbor is the self-match for all subsequences. Essentially, we are "tricking" stump into thinking that we are performing an AB-join, where exclusion zones are not used at all. Of course, you may be wondering, "couldn't we have omitted the ignore_trivial=False and simply done":

stumpy.stump(steam_df['water level'], m, steam_df['water level'])

And the answer is "no" because, behind the scenes, we have implemented some checks that compare whether T_A and T_B are "equal" and, if "yes", then this automatically implies a self-join (not an AB-join) and then exclusion zones are applied once again (i.e., against your will). This is really there to prevent (new??) users from shooting themselves in the foot (by playing a silly trick). So, we purposely force the user to jump through a few hoops if they truly want to override the self-join behavior (via being "explicit over implicit"). At the end of the day, the goal of STUMPY is to faithfully reproduce the original matrix profile publications/work and they have advised that an exclusion zone of +/- m/2 is "good enough" for a large majority of use cases and so that is what we've chosen as well.

Clearly, something is not working as expected.

Unfortunately/fortunately, this is the expected behavior. To reiterate, self-matches/trivial-matches for self-joins are avoided unless the user tricks stump. In that case, the user is on their own πŸ˜„. Let me know if that makes sense or if you have any follow up questions and please proceed with caution.

@seanlaw what a clear and nice to read reply. Thank you very much for all the clarification.

I was interpreting the meaning of the parameter in the wrong way. Allow me to say that the document is not clear about the meaning ignore_trivial, I get it well with your explanation.

Regarding self-join with ignore_trivial = False, I was doing it as unit test just to see if the output is correct for this corner case.

Clearly, everything is working as expected πŸ˜„ , however, I thing this behaviour is misleading (it led me to opening this issue πŸ˜„), even for new users. If the parameter is overwritten under the hood, it would be useful to show a warning to the user. Something like "ignore_trivial is set back to True as the two time series are the same." Even better, I would actually return the correct output with 0 everywhere for the first NN, and whatever is the distance for the 2, 3, ... NN if k>1. In that way I can use stumpy can be use to compare multiple time series and find out which one are the same and which are not, this is very important in many tasks such time classification and clustering. I think it would be clearer if the output of the function is strictly aligned with the input parameters.

Additionally, I will suggest to add the exclusion zone as parameter that the user could change and use +/- m/2 only when this parameter is not set (i.e None). This will be very useful for AB-join, but also for self-joint in a bunch of applications.

Once more time, thank you for the quick and in-depth reply.

Allow me to say that the document is not clear about the meaning ignore_trivial, I get it well with your explanation.

Your feedback is well received. I've opened a new issue to track this but a PR is also welcome πŸ˜„

If the parameter is overwritten under the hood, it would be useful to show a warning to the user.

I have misspoken. If something is in fact changed underneath the hood, a warning is presented. So, both:

stumpy.stump(steam_df['water level'], m, steam_df['water level'])
stumpy.stump(steam_df['water level'], m)

produce the exact same results (ignore_trivial=True) and nothing is changed behind the scenes. The only times we change the behavior is when:

  1. you only provide a single time series T_A and ignore_trivial is always forcefully set to True since it can only be a self-join and there ignoring the trivial/self-match should not be allowed. Currently, there is no warning but we plan to add it (see issue #930).
  2. you have an AB-join AND ignore_trivial=True, which makes no sense. In this case, ignore_trivial is forcefully set to False (since there is no interpretable meaning of an "exclusion zone" when there are two unrelated time series) AND a warning is presented.

See this code for more details:

stumpy/stumpy/core.py

Lines 3638 to 3692 in 58ccc69

def check_ignore_trivial(T_A, T_B, ignore_trivial):
"""
Check inputs and verify the appropriateness for self-joins vs AB-joins and
provides relevant warnings.
Note that the warnings will output the first occurrence of matching warnings
for each location (module + line number) where the warning is issued
Parameters
----------
T_A : numpy.ndarray
The time series or sequence for which to compute the matrix profile
T_B : numpy.ndarray
The time series or sequence that will be used to annotate T_A. For every
subsequence in T_A, its nearest neighbor in T_B will be recorded. Default is
`None` which corresponds to a self-join.
ignore_trivial : bool
Set to `True` if this is a self-join. Otherwise, for AB-join, set this
to `False`.
Returns
-------
ignore_trivial : bool
The (corrected) ignore_trivial value
Notes
-----
These warnings may be supresse by using a context manager
```
import stumpy
import numpy as np
import warnings
T = np.random.rand(10_000)
m = 50
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="Arrays T_A, T_B are equal")
for _ in range(5):
stumpy.stump(T, m, T, ignore_trivial=False)
```
"""
if ignore_trivial is False and are_arrays_equal(T_A, T_B): # pragma: no cover
msg = "Arrays T_A, T_B are equal, which implies a self-join. "
msg += "Try setting `ignore_trivial = True`."
warnings.warn(msg)
if ignore_trivial and are_arrays_equal(T_A, T_B) is False: # pragma: no cover
msg = "Arrays T_A, T_B are not equal, which implies an AB-join. "
msg += "`ignore_trivial` has been automatically set to `False`."
warnings.warn(msg)
ignore_trivial = False
return ignore_trivial

Note that version 1.9.2 may be too outdated and these warnings/changes may not exist. Please try the latest version to confirm.

Even better, I would actually return the correct output with 0 everywhere for the first NN, and whatever is the distance for the 2, 3, ... NN if k>1. In that way I can use stumpy can be use to compare multiple time series and find out which one are the same and which are not, this is very important in many tasks such time classification and clustering. I think it would be clearer if the output of the function is strictly aligned with the input parameters.

Hmmm, maybe I'm not understanding your point. While it may be true that two identical time series would have identical top-k matrix profiles, it seems like a really roundabout way to confirm that the two time series are the same. Instead, one could/should do something like:

stumpy/stumpy/core.py

Lines 471 to 498 in 58ccc69

def are_arrays_equal(a, b): # pragma: no cover
"""
Check if two arrays are equal; first by comparing memory addresses,
and secondly by their values.
Parameters
----------
a : numpy.ndarray
NumPy array
b : numpy.ndarray
NumPy array
Returns
-------
output : bool
This is `True` if the arrays are equal and `False` otherwise.
"""
if id(a) == id(b):
return True
# For numpy >= 1.19
# return np.array_equal(a, b, equal_nan=True)
if a.shape != b.shape:
return False
return bool(((a == b) | (np.isnan(a) & np.isnan(b))).all())

Note that this check is much, much cheaper than computing the matrix profile, especially as the length of your time series gets larger.

Additionally, I will suggest to add the exclusion zone as parameter

A long, long time ago, the exclusion zone was a function parameter. However, rather than have it be a local function parameter, we decided to move this to a global configuration parameter via:

import stumpy
from stumpy import config

config.STUMPY_EXCL_ZONE_DENOM = 8  # This is a global config parameter that is seen by all STUMPY functions

And then, in ALL functions where the exclusion zone is relevant, the exclusion zone would be computed as:

excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))

The reason why we decided to do it this way was because, you guessed it, we wanted to prevent the user from shooting themselves in the foot. One of the most natural things that one might do is:

  1. Compute the matrix profile using stumpy.stump
  2. Find motifs from the matrix profile using stumpy.motifs

Originally (i.e., a long, long time ago), you might have done:

mp = stumpy.stump(T, m, excl_zone=8)  # This no longer works 
motif_distances, motif_indices = stumpy.motifs(T, mp[:, 0])  # Notice that no exclusion zone is specified so a default is used

However, by omitting excl_zone when stumpy.motifs was called, the motifs are now found with a completely different exclusion zone (since it is not specified, the default m/2 is used) than the one used to compute the matrix profile! The motif results would be complete and utter nonsense! Thus, we decided to circumvent this by enforcing any changes to the exclusion zone to be as explicit as possible. So now, to change the exclusion zone, one must be explicit and do:

import stumpy
from stumpy import config

m = 32
config.STUMPY_EXCL_ZONE_DENOM = 4  # This makes the exclusion zone exactly and explicitly 8

mp = stumpy.stump(T, m) 
motif_distances, motif_indices = stumpy.motifs(T, mp[:, 0])  

And all functions are "aware" and will use the same exclusion zone accordingly. 99.9% of the time, users will never change the exclusion zone. For the minority of cases, changing the exclusion zone should be a very explicit/deliberate action. We would rather users ask than spend all of our time repeatedly trying to help users troubleshoot a forgotten excl_zone parameter. Since we are a very small community of volunteers, we decided to minimize this kind of troubleshooting and instead, focus on making it straightforward (via sane defaults) for the vast majority of use cases. At the end of the day, it is a tradeoff that we have made.

Thanks for the reply,

Note that version 1.9.2 may be too outdated and these warnings/changes may not exist. Please try the latest version to confirm.

For some reasons, conda was not picking the latest stumpy during installation. I guess it was because of compatibility with existing packages in my environment. So, I created a fresh environment and installed the latest version (1.12.0).

I can see some differences, I have warnings...from time to time:

Running mp = stumpy.stump(steam_df['steam flow'], m, ignore_trivial=False) does not give a warning. But the output is not 0, which is the expected behavior given you explanation. Only the warning is missing here.

But running mp = stumpy.stump(steam_df['steam flow'], m, steam_df['steam flow'], ignore_trivial=False) give me the warning UserWarning: Arrays T_A, T_B are equal, which implies a self-join. Try setting ignore_trivial = True. and the resulting mp is 0 which makes sense.

However, subsequent executions of the same code do not give the warning anymore. I have to restart my Jupiter notebook for the warning to show up.

Don't you think mp = stumpy.stump(steam_df['steam flow'], m, ignore_trivial=False) and mp = stumpy.stump(steam_df['steam flow'], m, steam_df['steam flow'], ignore_trivial=False) should behave exactly the same ?

Hmmm, maybe I'm not understanding your point. While it may be true that two identical time series would have identical top-k matrix profiles, it seems like a really roundabout way to confirm that the two time series are the same.
I think you get me right and I totally agree with you, my comment is mostly about what one would expect when using the library. And not all the users will read the matrixprofile papers before using stumpy. Personally, I would expect the following behavior:

def stump(tA, m, tB):
    if arrays_are_equal(tA, tB):
        mp = np.empty((len(tA), 4))
        mp[:, 0] = 0
        mp[:, 1] = np.arange(len(tA))
        mp[:, 2:] = -1
        return mp
   return _stump(tA, m, tB) 

This way, the warning becomes optional, and there is no need to overwrite the trivial matching parameter. Again, I am new to this library, so I don't know all the constraints to take into account. Let me know if I am missing something here.

Your explanation on the exclusion is clear, I am happy that there is a way to change it.

Originally (i.e., a long, long time ago), you might have done:

mp = stumpy.stump(T, m, excl_zone=8)  # This no longer works 
motif_distances, motif_indices = stumpy.motifs(T, mp[:, 0])  # Notice that no exclusion zone is specified so a default is used

I might be missing something here, stumpy.stump() already returns the distances and the indices. So why does stumpy.motifs() still need the exclusion zone? Isn't the following code doing the same:

mp = stumpy.stump(T, m, excl_zone=8)
motif_indices= np.argsort(mp[:, 0])[:k] # top k motifs
motif_distances = mp[motif_idx, 0]

Forget about the last point, I got the answer from the documentation.

image

I can see some differences, I have warnings...from time to time:
However, subsequent executions of the same code do not give the warning anymore. I have to restart my Jupiter notebook for the warning to show up.

Yes, unlike Python logging, Python warnings are only set up to warn you the first time something happens rather than repeatedly spamming you. It's a blessing and a curse but warnings can be easily caught/ignored programmatically while logging is difficult to "catch". When you restart the Jupyter notebook this essentially clears the cache. You'll notice that this is similar to or consistent with how some warnings are handled in numpy, sklearn, or pandas.

Don't you think mp = stumpy.stump(steam_df['steam flow'], m, ignore_trivial=False) and mp = stumpy.stump(steam_df['steam flow'], m, steam_df['steam flow'], ignore_trivial=False) should behave exactly the same ?

Maybe?! I can see it both ways. Purely from an API standpoint (and ignoring the fact that the time series are the same), the first is a self-join and the latter is an AB-join, which should have different expectations. While I understand your rationale, IMHO, I think 99.9% of users will have no reason to do either of these things. I don't mean to come across as argumentative/stubborn but given that this is a super edge case, I don't think there is perfect solution. I'll try to spend some time to think about it.

This way, the warning becomes optional, and there is no need to overwrite the trivial matching parameter.

I will have to think about this more. It might be doable

Thanks @seanlaw ,

I agree that this is an edge case, and only few users would need to do this.

I consider my initial query resolved or will be with #930 , so feel free to close this issue.