unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pandera timezone-agnostic datetime type

max-raphael opened this issue · comments

Is your feature request related to a problem? Please describe.
When defining a class that inherits from DataFrameModel, I want to define a field whose values are datetimes. Moreover, those values will have timezones. However, I will not be able to define during the class definition what timezone that may be. In other words, in dataframe A, they may be datetimes with tz="America/New_York. In dataframe B, they may be datetiems with tz="America/Los_Angeles". As far as I can tell, there is no type that I can assign that will allow me to pass datetimes with timezones, but not specify which timezone within the type hint.

Describe the solution you'd like
I would like there to be a type that I can use to say "this field will be datetimes, but I can't say what the timezone will be."

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

When setting the type of the field to datetime.datetime, pandera.dtypes.DateTime, etc. I get a pandera SchemaError that the series was expected to have type datetime64[ns], but got datetime64[ns, America/New_York] (for example).

I have also tried with DatetimeTZDtype, but that won't work because I need to specify the timezone I want (which I can't do upfront).

Additional context

Example Schema:

class MySchema(DataFrameModel):
local_datetime: <what type do I set here?>

Hi @max-raphael this is somewhat of a challenging use case to fulfill with datetimes because if we have a timezone-agnostic datetime, how do we deal with coercion?

Imagine we support something like:

class MySchema(DataFrameModel):
    local_datetime: DateTime(has_tz=True)  # just checks that the datetimes have any timezone

    class Config:
        coerce = True

If we do coerce=True, what timezone should we coerce to? Solutions here would be:

  • Default to UTC
  • Raise an exception

This is similar to the problem of having a generic Number type: this can check if the data type is any of the int or float types, but when we coerce, what data type should it default to?

I hear you, that does pose a tricky problem. Thinking about it from my perspective as a user, I think I would prefer to have this as an option but be disallowed from coercing this field (via some Exception) due to the ambiguous nature of the data type rather than not have it accessible to me at all.

Perhaps even an Exception is too much. Pandera could still allow users to specify coerce=True and coerce other fields, and add a warning level log statement that informs the user that this field cannot be coerced due to its data type.

Perhaps even an Exception is too much. We could still allow users to specify coerce=True and coerce other fields, and add a warning level log statement that informs the user that this field cannot be coerced due to its data type.

How would you feel about defaulting to UTC on coercion (if the incoming raw data is not TZ-aware) and raising a warning that the dtypes are coerced to UTC? I generally like to do something rather than nothing on coercion to prevent propagation of surprise (i.e. a non-TZ aware dataframe after validation with coerce=True).

That seems acceptable to me. I think if incoming data is not tz-aware, then that's a reasonable approach so long as Pandera logs the warning and includes it in the documentation!

@cosmicBboy Hi, just following up here. Are we aligned on the feature? If so, what are the next steps? Thanks again for engaging with this, I think it would be helpful to many Pandera users.

Yep! Feel free to make a PR with changes to the DateTime type: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/pandas_engine.py#L792C3-L792C3 and add new unit tests in the appropriate test module.

Also check out the contributing guide if it's your first time contributing: https://pandera.readthedocs.io/en/stable/CONTRIBUTING.html