Convert Boolean Arrays to Sets
seanlaw opened this issue · comments
In many functions, we currently pre-compute a T_subseq_isfinite
or T_subseq_isconstant
. However, as the length of the time series increases, these data structures also increase proportionately in length. For a typically long time series, we'd expect:
T_subseq_isfinite
will be mostly filled withTrue
T_subseq_isconstant
will be mostly filled withFalse
From a memory standpoint, it is probably best to capture/store the minority cases (i.e., T_subseq_isinfinite
, note INIFINITE here, and T_subseq_isconstant
) as they will take up the least amount of space/memory. The most efficient way to handle this (yet to be tested) is to simply use Python sets.
Here is a trivial example:
def test(T_subseq_isinfinite, T_subseq_isconstant):
for i in range(100_000_000):
if i in T_subseq_isinfinite and i in T_subseq_isconstant:
pass
Sadly, support for using Python sets directly in numba is being deprecated. Though, a typed.List
has been added:
from numba import njit
from numba.typed import List
@njit
def foo(x):
x.append(10)
a = [1, 2, 3]
typed_a = List()
[typed_a.append(x) for x in a]
foo(typed_a)
and typed.Set
is expected to be implemented soon but, for now, something like this also works but comes with a deprecation warning:
@njit
def test(T_subseq_isinfinite, T_subseq_isconstant):
for i in range(100_000_000):
if i in T_subseq_isinfinite and i in T_subseq_isconstant:
pass
Once typed.Set
is added to numba
, then we should be able to save a ton on storage!
Note: This would mean replacing the T_subseq_isfinite
with T_subseq_isinfinite
in stumpy.mass
and stumpy.match
(and in other public API) as well as all internal functions and only allowing T_subseq_isconstant
be a set. Also, if a function is used, it must return a list where True
and it gets converted to a set internally (rather than a NumPy array).
Note: One would need to check if sets are available for cuda
!
After reading about how sets are implemented, (hash table) they may not be as space/memory efficient as I anticipated. Instead, we could simply stick to creating T_subseq_isinfinite
and T_subseq_isconstant
, which should both be short since they contain only the indices of rare subsequences.
According to this SO answer, np.searchsorted
is still the fastest assuming the arrays are already pre-sorted (which they should be):
@njit
def contains(a, v):
return a[np.searchsorted(a, v)] == v
if not contains(T_subseq_isinfinite, idx) and contains(T_subseq_isconstant, idx):
pass
After consulting with some experts, I realized that even though boolean arrays require one byte per element, this is still very small for 100 million elements and isn't worth optimizing