TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis

Home Page:https://stumpy.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convert Boolean Arrays to Sets

seanlaw opened this issue · comments

In many functions, we currently pre-compute a T_subseq_isfinite or T_subseq_isconstant. However, as the length of the time series increases, these data structures also increase proportionately in length. For a typically long time series, we'd expect:

  1. T_subseq_isfinite will be mostly filled with True
  2. T_subseq_isconstant will be mostly filled with False

From a memory standpoint, it is probably best to capture/store the minority cases (i.e., T_subseq_isinfinite, note INIFINITE here, and T_subseq_isconstant) as they will take up the least amount of space/memory. The most efficient way to handle this (yet to be tested) is to simply use Python sets.

Here is a trivial example:

def test(T_subseq_isinfinite, T_subseq_isconstant):
    for i in range(100_000_000):
        if i in T_subseq_isinfinite and i in T_subseq_isconstant:
            pass

Sadly, support for using Python sets directly in numba is being deprecated. Though, a typed.List has been added:

from numba import njit
from numba.typed import List

@njit
def foo(x):
    x.append(10)

a = [1, 2, 3]
typed_a = List()
[typed_a.append(x) for x in a]
foo(typed_a)

and typed.Set is expected to be implemented soon but, for now, something like this also works but comes with a deprecation warning:

@njit
def test(T_subseq_isinfinite, T_subseq_isconstant):
    for i in range(100_000_000):
        if i in T_subseq_isinfinite and i in T_subseq_isconstant:
            pass

Once typed.Set is added to numba, then we should be able to save a ton on storage!

Note: This would mean replacing the T_subseq_isfinite with T_subseq_isinfinite in stumpy.mass and stumpy.match (and in other public API) as well as all internal functions and only allowing T_subseq_isconstant be a set. Also, if a function is used, it must return a list where True and it gets converted to a set internally (rather than a NumPy array).

Note: One would need to check if sets are available for cuda!

After reading about how sets are implemented, (hash table) they may not be as space/memory efficient as I anticipated. Instead, we could simply stick to creating T_subseq_isinfinite and T_subseq_isconstant, which should both be short since they contain only the indices of rare subsequences.

According to this SO answer, np.searchsorted is still the fastest assuming the arrays are already pre-sorted (which they should be):

@njit
def contains(a, v):
    return a[np.searchsorted(a, v)] == v

if not contains(T_subseq_isinfinite, idx) and contains(T_subseq_isconstant, idx):
    pass

After consulting with some experts, I realized that even though boolean arrays require one byte per element, this is still very small for 100 million elements and isn't worth optimizing