uqfoundation / dill

serialize all of Python

Home Page:http://dill.rtfd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nan type drift for np.nan

gatorwatt opened this issue · comments

This issue appears to be common for dill and pickle, when serializing and recovering dictionaries populated with np.nan entries drift their data type to the default of float("nan"). Normally this shouldn't be a big deal as they both adhere to IEEE 754, except that python dictionaries have an edge case / incongruity for the two nan types when using nan as a key. Specifically, for the dictionary e.g. a_dict = {np.nan:1234} one can access the value with a_dict[np.nan}, but for the dictionary b_dict = {float("nan"):4321}, attempting to access b_dict[float("nan")] returns a halt bug, and similarly for other dictionary methods like .pop().

Ideally for a serialized dictionary one could be able to retain the nan type such as to retain any peculiarities of this nature.

Python 3.8.18 (default, Aug 25 2023, 04:23:37) 
[Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> d = {np.nan: 1234, float('nan'): 4321}
>>> d
{nan: 1234, nan: 4321}
>>> d[np.nan]
1234
>>> d[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan
>>> 

...and you are saying that pickling a np.nan can get converted to float('nan') after a dump then load? in some cases, or always?

Yeah to my experience this is pretty universal for dill / pickle, even in different import scenarios. I did a little digging and appears that the Numpy version of np.nan refers to some global representation such that the root of matter is that np.nan is np.nan == True while in more generic floats float("nan") is float("nan") == False, where I believe this disparity is source of several other python edge cases like using nan as a key in a dictionary (which is supported for np.nan but not for float("nan").

After thinking about it, for my specific use case in the Automunge library decided to remove exposure to nan dictionary key scenario and use None in place of nan, so if you want to close this issue I think my concern is resolved by that update.

Thanks

Python 3.8.18 (default, Aug 25 2023, 04:23:37) 
[Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> import numpy as np
>>> n = dill.copy(np.nan)
>>> m = dill.copy(float('nan'))
>>> n is np.nan
False
>>> m is np.nan
False
>>> import copy
>>> o = copy.deepcopy(np.nan)
>>> o is np.nan
True

I'm going to reopen this issue, as I think the dill.copy should produce a np.nan and not a nan, and this is something that can be corrected for within dill.