pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Home Page:https://pandas.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BUG: .max() raises exception on Series with object dtype and mixture of Timestamp and NaT: TypeError: '>=' not supported between instances of 'Timestamp' and 'float'

kerrickstaley opened this issue · comments

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df1 = pd.DataFrame({
    'a': [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York')],
})
df2 = pd.DataFrame({
    'a': [pd.NaT],
})
df_concat = pd.concat([df1, df2])
df_concat['a'].max()

Issue Description

The above code raises the exception:

TypeError: '>=' not supported between instances of 'Timestamp' and 'float'

This code is a simplified version of some prod code that caused issues at my work.

I believe what's happening is df1 has dtype datetime[ns, America/New_York] for the column and df2 has dtype datetime[ns], and when you concat them, the resulting dtype is object. Then, .max() coerces pd.NaT to NaN, and you get a comparison between a Timestamp and a float.

Expected Behavior

I expect the code to return pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York').

I think Pandas should be robust and handle this case, even though the dtypes aren't perfectly "correct". I think the right place to fix this is in the .max() function: the code

max([pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York'), pd.NaT])

(where max is builtins.max) works fine, so you would also expect the Pandas equivalent to work.

You could maybe also make an argument that pd.concat should special-case this and return a column with dtype datetime64[ns, America/New_York], but I'm less sure about that.

Longer-term, I feel like Pandas should support datetime columns with heterogenous timezones; the requirement that timezones be the same for a whole column feels like an artificial constraint and many real-world datasets will naturally have heterogenous timezones.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2e python : 3.12.3.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.1.1
pip : 24.0
Cython : None
pytest : 8.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

Thanks for the report. As a workaround, df_concat['a'].convert_dtypes().max() works