Extreme dates cannot be renderes in displaying DataFrames
xhochy opened this issue · comments
We can store dates in fletcher
that pandas
cannot store as we allow for other precisions as nanoseconds. Sadly our code currently converts to nanoseconds for printing a DataFrame.
Reproducible example:
import fletcher as fr
import pandas as pd
import datetime
df = pd.DataFrame({
"date": fr.FletcherContinuousArray([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])
})
print(df.head())
Exception:
Traceback (most recent call last):
File "extreme_dates.py", line 8, in <module>
print(df.head())
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 680, in __repr__
self.to_string(
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 820, in to_string
return formatter.to_string(buf=buf, encoding=encoding)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 914, in to_string
return self.get_result(buf=buf, encoding=encoding)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 521, in get_result
self.write_result(buf=f)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 823, in write_result
strcols = self._to_str_columns()
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 759, in _to_str_columns
fmt_values = self._format_col(i)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 948, in _format_col
return format_array(
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1172, in format_array
return fmt_obj.get_result()
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1203, in get_result
fmt_values = self._format_strings()
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1489, in _format_strings
array = np.asarray(values)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "/Users/uwe/Development/fletcher/fletcher/base.py", line 328, in __array__
return self.data.to_pandas().values
File "pyarrow/array.pxi", line 567, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/array.pxi", line 1027, in pyarrow.lib.Array._to_pandas
File "pyarrow/array.pxi", line 1209, in pyarrow.lib._array_like_to_pandas
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000
That's the conversion in pyarrow failing (when converting to scalar objects):
In [23]: a = pa.array([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])
In [24]: a.to_pandas()
...
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000
In [28]: list(a)
Out[28]: [datetime.datetime(9999, 12, 1, 0, 0), datetime.datetime(9999, 12, 1, 0, 0)]
For dates, we have a date_as_object
option in to_pandas
, we should probably have something similar for timestamps.
BTW, the scalar conversion (eg converting to a list) does correctly use datetime.datetime class when it is not ns resolution.
Now, that's for the pyarrow side. On the pandas side, I am surprised we go through np.asarray
for formatting ExtensionArrays. Normally we should just pass the actual scalars from the ExtensionArray (iter(ea)
should be sufficient for calling EA._formatter
). That seems like a bug.
For the pyarrow
/fletcher
side, the issue is that we use in fletcher
the hack to go through pd.Series
(via self.data.to_pandas().values
) to get a NumPy array. Thus we force the conversion to ns
. Instead we should use to_numpy
(which is sadly only available on pa.Array
, not pa.ChunkedArray
).
On the pandas
side there is a hard conversion to NumPy arrays baked in here: https://github.com/pandas-dev/pandas/blob/01f73100d5f7b942a796ffd000962dee28b43f9c/pandas/io/formats/format.py#L1462
Passing in a correct numpy.array
sadly also triggers the conversion to nanoseconds:
../pandas/pandas/io/formats/format.py:752: in _to_str_columns
fmt_values = self._format_col(i)
../pandas/pandas/io/formats/format.py:936: in _format_col
return format_array(
../pandas/pandas/io/formats/format.py:1159: in format_array
return fmt_obj.get_result()
../pandas/pandas/io/formats/format.py:1190: in get_result
fmt_values = self._format_strings()
../pandas/pandas/io/formats/format.py:1464: in _format_strings
fmt_values = format_array(
../pandas/pandas/io/formats/format.py:1159: in format_array
return fmt_obj.get_result()
../pandas/pandas/io/formats/format.py:1190: in get_result
fmt_values = self._format_strings()
../pandas/pandas/io/formats/format.py:1439: in _format_strings
values = DatetimeIndex(values)
../pandas/pandas/core/indexes/datetimes.py:249: in __new__
dtarr = DatetimeArray._from_sequence(
../pandas/pandas/core/arrays/datetimes.py:312: in _from_sequence
subarr, tz, inferred_freq = sequence_to_dt64ns(
../pandas/pandas/core/arrays/datetimes.py:1755: in sequence_to_dt64ns
data = conversion.ensure_datetime64ns(data)
That's for sure a bug then. We shouldn't try to coerce again if it's coming from an EA, we should simply call EA._formatter on the individual values.
Can you open an issue for pandas?
So just as a test, if you return an object dtype array, does it work then?
So just as a test, if you return an object dtype array, does it work then?
Yes, then it works.
Can you open an issue for pandas?
Yes, made a semi-open issue: pandas-dev/pandas#33319
Thanks!
The fletcher
side is fixed by #119 while on the pandas side we need pandas-dev/pandas#33319
This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas
, the major goal of this project has been fulfilled.