unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parametrized type annotations are broken for polars DataFrameModels

r-bar opened this issue · comments

Description

Pandera DataFrameModels do not support parameterized types for polars, while DataFrameSchemas do.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Example

Here is an example of a working DataFrameSchema and several variations of broken DataFrameModels.

from typing import Annotated

import pandera.polars as pa
import polars as pl

from pandera.typing import Series
from pandera.errors import SchemaInitError


df = pl.DataFrame({
    "id": [1, 2, 3],
    "lists": [["a"], ["a", "b"], ["a", "b", "c"]],
})


# works!
schema = pa.DataFrameSchema(
    columns={
        "id": pa.Column(int),
        "lists": pa.Column(list[str]),
    }
)
schema.validate(df)
print("DataFrameSchema validation passed")


class Lists(pa.DataFrameModel):
    """Most basic, expected form given the working schema above."""
    id: int
    lists: list[str]


try:
    Lists.validate(df)
except SchemaInitError as e:
    print("\nLists validation failed")
    print(e)
else:
    print("\nLists validation passed")


class ListsSeries(pa.DataFrameModel):
    """Using series as a wrapper around basic data types like the id column here
    will not work. Examples of this appear in the pandera documentation.
    https://pandera.readthedocs.io/en/latest/dataframe_models.html#dtype-aliases
    """
    id: Series[int]
    lists: Series[list[str]]


try:
    ListsSeries.validate(df)
except SchemaInitError as e:
    print("\nListsSeries validation failed")
    print(e)
else:
    print("\nListsSeries validation passed")


class AlternateListsSeries(pa.DataFrameModel):
    """Demonstrating using Series as a type wrapper around only lists to avoid
    the initialization error on id."""
    id: int
    lists: Series[list[str]]


try:
    AlternateListsSeries.validate(df)
except SchemaInitError as e:
    print("\nAlternateListsSeries validation failed")
    print(e)
else:
    print("\nAlternateListsSeries validation passed")


class ListsAnnotated(pa.DataFrameModel):
    """Parameterized form using Annotated as suggested at
    https://pandera.readthedocs.io/en/latest/polars.html#nested-types
    """
    id: int
    lists: Series[Annotated[list, str]]


try:
    ListsAnnotated.validate(df)
except TypeError as e:
    print("\nListsAnnotated validation failed")
    print(e)
else:
    print("\nListsAnnotated validation passed")


class ListsAnnotatedStr(pa.DataFrameModel):
    """Alternate parameterized form using Annotated as seen in the examples here:
    https://pandera.readthedocs.io/en/latest/dataframe_models.html#annotated
    """
    id: int
    lists: Series[Annotated[list, "str"]]

try:
    ListsAnnotatedStr.validate(df)
except TypeError as e:
    print("\nListsAnnotatedStr validation failed")
    print(e)
else:
    print("\nListsAnnotatedStr validation passed")

When run with the following python / library versions:

  • python==3.11.9
  • polars==0.20.19
  • pandera[polars]==0.19.0b1

the above script produces:

DataFrameSchema validation passed

Lists validation failed
Invalid annotation 'lists: list[str]'

ListsSeries validation failed
Invalid annotation 'id: pandera.typing.pandas.Series[int]'

AlternateListsSeries validation failed
Invalid annotation 'lists: pandera.typing.pandas.Series[list[str]]'

ListsAnnotated validation failed
Annotation 'Annotated' requires all positional arguments ['args', 'kwargs'].

ListsAnnotatedStr validation failed
Annotation 'Annotated' requires all positional arguments ['args', 'kwargs'].

Expected behavior

I would expect any column types that are valid to pass to DataFrameSchema's constructor to also be valid as annotations for DataFrameModel.

Desktop (please complete the following information):

  • OS: MacOS 14.4.1 (M2 Max)
  • Browser: Chrome
  • Version: 0.19.0b1

Thanks for reporting this @r-bar FYI Series[Type] annotations is currently not supported in the polars API, see #1588 and ongoing discussion here: #1594.

Looking into this, planning on supporting:

class Lists(pa.DataFrameModel):
    """Most basic, expected form given the working schema above."""
    id: int
    lists: list[str]