brandtbucher / automap

High-performance autoincremented integer-valued mappings. 🗺️

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multiple NaNs permitted in an AutoMap if provided via np.ndarray

flexatone opened this issue · comments

While AutoMap correct rejects initialization from a tuple of two NaNs, if an np.ndarray of NaNs is provided, initialization succeeds.

>>> automap.AutoMap((np.nan, np.nan))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
ValueError: nan

>>> automap.AutoMap(np.array((np.nan, np.nan)))
automap.AutoMap([nan, nan])

These observations may relate:

>>> frozenset((np.nan, 1.)) | frozenset((np.nan, 1.))
frozenset({nan, 1.0})

>>> frozenset(np.array((np.nan, 1.))) | frozenset(np.array((np.nan, 1.)))
frozenset({nan, 1.0, nan})

>>> frozenset(np.array((np.nan, 1.)).tolist()) | frozenset(np.array((np.nan, 1.)).tolist())
frozenset({nan, 1.0, nan})

This comes back to the issue of "same" vs "different" NaNs.

>>> a, b = float('nan'), float('nan')
>>> a is b
False

As you observed, automap (like all native containers in Python) has two ways of verifying containment: elementwise identity, and elementwise equality.

Obviously for NaNs, equality is out, which leaves us with identity as the only way compare two NaNs. This, as we've seen, behaves differently depending on the exact objects being compared:

>>> a in [a]
True
>>> a in {a}
True
>>> a in {a: None}
True
>>> a in [b]
False
>>> a in {b}
False
>>> a in {b: None}
False

While unintuitive, there are good reasons why NaN isn't a singleton object (even though np.nan is often treated as this "singleton", and sometimes works like one). There are also other good reasons to use the identity shortcut here.


I cannot imagine a legitimate use-case where NaN is an actual key into a Series or Frame... I think this discovery was mostly just an oddity arising from weird behavior elsewhere, right? Without actual use-cases (and good reasons why our behavior should differ from dict for them), it seems foolish to me to hardcode special cases for this in critical sections of our hash-table lookup code. The real use-cases will almost certainly suffer, without even considering the questions of complex NaNs, signaling NaNs, negative NaNs, etc...

However, since static_frame is probably our only client, I might be okay bending if it's absolutely necessary.

Thanks for these comments. As the behavior of AutoMap is the same as built-in containers, I agree that at this time we can avoid special handling. SF uses AutoMap to determine uniqueness, but for now we will just have to accept that you can have multiple NaNs and still have unique values.

What I do not entirely understand, however, is why putting the np.nan in an array changes the behavior; i.e., why do we get different results loading an AutoMap via tuple versus an array?

It appears that NumPy doesn't round-trip the np.nan "singleton" when it's converted to a native floating-point type:

>>> np.nan is np.array([np.nan]).tolist()[0]
False

When stored as an object, though, it behaves as expected:

>>> np.nan is np.array([np.nan], object).tolist()[0]
True

The same holds for just iterating over NumPy float values (which is what automap does). Each new NaN is a unique object:

>>> [id(n) for n in np.array([np.nan, np.nan, np.nan])]
[139675351170512, 139674883707792, 139674883708112]
>>> [id(n) for n in np.array([np.nan, np.nan, np.nan], object)]
[139675324443344, 139675324443344, 139675324443344]

I did some performance tests to confirm an observation that loading these multiple instances of NaN into mappings (both AutoMap and dict) massively degrades performance compared to other same-sized iterables. While I understand == is always false and we have to use id for comparisons to NaN, that loading iterables of NaN instances is so less performant is not yet intuitive for me. Do you have any thoughts on why this is?

In [11]: a1 = np.arange(10000)                                                                                                     

In [12]: a2 = np.full(10000, np.nan)                                                                                               

In [13]: %timeit automap.AutoMap(a1)                                                                                               
557 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [14]: %timeit automap.AutoMap(a2)                                                                                               
1.57 s ± 58.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit dict.fromkeys(a1)                                                                                                 
757 µs ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [16]: %timeit dict.fromkeys(a2)                                                                                                 
1.43 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Hash collisions. All NaNs hash to 0.

Ah, so this is the worst-case scenario for hash collisions!