Validating datetime columns regardless of timezone
robertdj opened this issue · comments
I am using Pandera with the new Polars plugin, which is really exciting.
I am validating a schema, where one of the columns is a Datetime
. I don't care if the Datetime
has a timezone or not.
However, Pandera appears to be strict about whether or not there is a timezone. Is it possible ignore the presence of a timezone?
I suppose this could be handled with a union of type as in this issue: #1152
Although I fear that I then have to provide all possible allowed time zones?
Hi @robertdj can you provide a code sample of the code you're working with?
Looking at the polars docs, could you use "*"
to match any timezone, including no timezone?
Thanks for your quick answer! I was actually using datetime
from the standard library:
class MySchema(pa.DataFrameModel):
timestamp: datetime
But good point with using Polars' datetime. This seems to be equivalent:
class MySchema(pa.DataFrameModel):
timestamp: pl.Datetime(time_zone=None)
Unfortunately, the docs say that if time_zone=*
it needs a valid time zone. But a union of time_zone=None
and time_zone=*
probably does the trick.
I think another way to handle this would be to override the check
method in pandera.engines.polars_engine.DateTime
so that a plain pl.Datetime
will pass validation check for both time_zone-aware and -unaware columns.
Does this make sense, or is the Union
solution less ambiguous?
I think your suggestion sounds like a much better default!
Have you had time to consider this @cosmicBboy ?
One last thought: would it be too cumbersome to import the pandera data type instead?
from pandera.engines.polars_engine import DateTime
class MySchema(pa.DataFrameModel):
timestamp: DateTime(tz_agnostic=True)
My main concern is that with my prior suggestion is that there would be no way to validate datetime types that don't have timezones:
- pl.Datetime: implicitly no timezone
- pl.Datetime(timezone=None): explicitly no timezone
- pl.Datetime(timezone=): some specific timezone
- pl.Datetime(timezone="*"): any timezone
If someone wanted to validate that a column is datetime and has no timezone, pl.DateTime
or pl.DateTime(timezone=None)
no longer provides that guarantee.
I think that is a good point. I'm fine with importing DateTime
from pandera.