unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Validating datetime columns regardless of timezone

robertdj opened this issue · comments

I am using Pandera with the new Polars plugin, which is really exciting.

I am validating a schema, where one of the columns is a Datetime. I don't care if the Datetime has a timezone or not.
However, Pandera appears to be strict about whether or not there is a timezone. Is it possible ignore the presence of a timezone?

I suppose this could be handled with a union of type as in this issue: #1152
Although I fear that I then have to provide all possible allowed time zones?

Hi @robertdj can you provide a code sample of the code you're working with?

Looking at the polars docs, could you use "*" to match any timezone, including no timezone?

Thanks for your quick answer! I was actually using datetime from the standard library:

class MySchema(pa.DataFrameModel):
    timestamp: datetime

But good point with using Polars' datetime. This seems to be equivalent:

class MySchema(pa.DataFrameModel):
    timestamp: pl.Datetime(time_zone=None)

Unfortunately, the docs say that if time_zone=* it needs a valid time zone. But a union of time_zone=None and time_zone=* probably does the trick.

I think another way to handle this would be to override the check method in pandera.engines.polars_engine.DateTime so that a plain pl.Datetime will pass validation check for both time_zone-aware and -unaware columns.

Does this make sense, or is the Union solution less ambiguous?

I think your suggestion sounds like a much better default!

Have you had time to consider this @cosmicBboy ?

One last thought: would it be too cumbersome to import the pandera data type instead?

from pandera.engines.polars_engine import DateTime

class MySchema(pa.DataFrameModel):
    timestamp: DateTime(tz_agnostic=True)

My main concern is that with my prior suggestion is that there would be no way to validate datetime types that don't have timezones:

  1. pl.Datetime: implicitly no timezone
  2. pl.Datetime(timezone=None): explicitly no timezone
  3. pl.Datetime(timezone=): some specific timezone
  4. pl.Datetime(timezone="*"): any timezone

If someone wanted to validate that a column is datetime and has no timezone, pl.DateTime or pl.DateTime(timezone=None) no longer provides that guarantee.

I think that is a good point. I'm fine with importing DateTime from pandera.