unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can you use Pydantic Field Aliasing with Pandera / PydanticModel schema definitions?

mcmasty opened this issue · comments

How to use Pydantic Field Alias with pandera

I am processing a CSV and I am trying to use Pandera to validate the data. The names in the CSV header row are not what I want the names in my model to be. I haven't figured out how to achieve field aliasing. Any suggestions?

Here is a snippet that reproduces the error I am getting.

import io
import pydantic
import pandas as pd
import pandera as pa

from pandera.engines.pandas_engine import PydanticModel


class AliasedRecord(pydantic.BaseModel):
    name: str = pydantic.Field(alias="Name")
    amt_in_local: float = pydantic.Field(alias="Amount in local currency")

class AliasDFSchema(pa.DataFrameModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(AliasedRecord)
        strict=True
        coerce = True  # this is required, otherwise a SchemaInitError is raised

# Direct Pydantic Model Validation
ar_m = AliasedRecord.model_validate({"Name":"Foo", "Amount in local currency": 1.32})
print(f"My Model is: {ar_m}")

# Now try validating a DataFrame
# Generate data similar to the source CSV
f = io.StringIO('Name,Amount in local currency\nfoo,1.32\nbar,3.34')
df1 = pd.read_csv(f)
validated_df = AliasDFSchema(df1)

Output

The successful Model:


My Model is: name='Foo' amt_in_local=1.32

The DataFrame / Pandera error ...

... bunch of stuff removed for brevity  

SchemaError: column 'Name' not in DataFrameSchema {}

df1 is correctly created

Screenshot 2023-10-18 at 18 33 30

Looks like PydanticModel doesn't interact well with strict=True. This works:

class AliasDFSchema(pa.DataFrameModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(AliasedRecord)
        coerce = True  # this is required, otherwise a SchemaInitError is raised

One potential fix for this would be to update the DataFrameSchema.__init__ method to special case the case where dtype = PydanticModel. Basically, just pull out the column names/aliases from the pydantic model and create a column dictionary.

Turning this into a bug issue in case anyone wants to open a PR!

I would like to have a crack at this please