xhochy / fletcher

Pandas ExtensionDType/Array backed by Apache Arrow

Home Page:https://fletcher.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extreme dates cannot be renderes in displaying DataFrames

xhochy opened this issue · comments

We can store dates in fletcher that pandas cannot store as we allow for other precisions as nanoseconds. Sadly our code currently converts to nanoseconds for printing a DataFrame.

Reproducible example:

import fletcher as fr
import pandas as pd
import datetime

df = pd.DataFrame({
    "date": fr.FletcherContinuousArray([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])
})
print(df.head())

Exception:

Traceback (most recent call last):
  File "extreme_dates.py", line 8, in <module>
    print(df.head())
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 680, in __repr__
    self.to_string(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 820, in to_string
    return formatter.to_string(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 914, in to_string
    return self.get_result(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 521, in get_result
    self.write_result(buf=f)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 823, in write_result
    strcols = self._to_str_columns()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 759, in _to_str_columns
    fmt_values = self._format_col(i)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 948, in _format_col
    return format_array(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1172, in format_array
    return fmt_obj.get_result()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1203, in get_result
    fmt_values = self._format_strings()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1489, in _format_strings
    array = np.asarray(values)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/Users/uwe/Development/fletcher/fletcher/base.py", line 328, in __array__
    return self.data.to_pandas().values
  File "pyarrow/array.pxi", line 567, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/array.pxi", line 1027, in pyarrow.lib.Array._to_pandas
  File "pyarrow/array.pxi", line 1209, in pyarrow.lib._array_like_to_pandas
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000

That's the conversion in pyarrow failing (when converting to scalar objects):

In [23]: a = pa.array([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])   

In [24]: a.to_pandas()    
...
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000

In [28]: list(a) 
Out[28]: [datetime.datetime(9999, 12, 1, 0, 0), datetime.datetime(9999, 12, 1, 0, 0)]

For dates, we have a date_as_object option in to_pandas, we should probably have something similar for timestamps.
BTW, the scalar conversion (eg converting to a list) does correctly use datetime.datetime class when it is not ns resolution.

Now, that's for the pyarrow side. On the pandas side, I am surprised we go through np.asarray for formatting ExtensionArrays. Normally we should just pass the actual scalars from the ExtensionArray (iter(ea) should be sufficient for calling EA._formatter). That seems like a bug.

For the pyarrow/fletcher side, the issue is that we use in fletcher the hack to go through pd.Series (via self.data.to_pandas().values) to get a NumPy array. Thus we force the conversion to ns. Instead we should use to_numpy (which is sadly only available on pa.Array, not pa.ChunkedArray).

On the pandas side there is a hard conversion to NumPy arrays baked in here: https://github.com/pandas-dev/pandas/blob/01f73100d5f7b942a796ffd000962dee28b43f9c/pandas/io/formats/format.py#L1462

Passing in a correct numpy.array sadly also triggers the conversion to nanoseconds:

../pandas/pandas/io/formats/format.py:752: in _to_str_columns
    fmt_values = self._format_col(i)
../pandas/pandas/io/formats/format.py:936: in _format_col
    return format_array(
../pandas/pandas/io/formats/format.py:1159: in format_array
    return fmt_obj.get_result()
../pandas/pandas/io/formats/format.py:1190: in get_result
    fmt_values = self._format_strings()
../pandas/pandas/io/formats/format.py:1464: in _format_strings
    fmt_values = format_array(
../pandas/pandas/io/formats/format.py:1159: in format_array
    return fmt_obj.get_result()
../pandas/pandas/io/formats/format.py:1190: in get_result
    fmt_values = self._format_strings()
../pandas/pandas/io/formats/format.py:1439: in _format_strings
    values = DatetimeIndex(values)
../pandas/pandas/core/indexes/datetimes.py:249: in __new__
    dtarr = DatetimeArray._from_sequence(
../pandas/pandas/core/arrays/datetimes.py:312: in _from_sequence
    subarr, tz, inferred_freq = sequence_to_dt64ns(
../pandas/pandas/core/arrays/datetimes.py:1755: in sequence_to_dt64ns
    data = conversion.ensure_datetime64ns(data)

That's for sure a bug then. We shouldn't try to coerce again if it's coming from an EA, we should simply call EA._formatter on the individual values.
Can you open an issue for pandas?

So just as a test, if you return an object dtype array, does it work then?

So just as a test, if you return an object dtype array, does it work then?

Yes, then it works.

Can you open an issue for pandas?

Yes, made a semi-open issue: pandas-dev/pandas#33319

The fletcher side is fixed by #119 while on the pandas side we need pandas-dev/pandas#33319

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.