unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

failure_case conversion failed : polars.exceptions.ComputeError - pandera(0.19.0b3) with polars

obiii opened this issue · comments

Describe the bug
We are trying a simple validation example using polars. We cant understand the problem or why it originates. But it throws polars.exceptions.ComputeError exception when any of the validation fails and there is null in data.

For example, in the code below, the dummy data contains extract_date feature with a None. It runs fine if the case_id are all int convertible string but throws the exception if any of the case_id is not int convertible.

Here is the code:

import pandera.polars as pa
import polars as pl
from datetime import date
import json

class CaseSchema(pa.DataFrameModel):
    case_id: int = pa.Field(nullable=False, unique=True, coerce=True)
    gdwh_portfolio_id: str = pa.Field(nullable=False, unique=True, coerce=True)
    extract_date: date = pa.Field(nullable=True, coerce=True)

    class Config:
        drop_invalid_rows = True

invalid_lf = pl.DataFrame({
    #"case_id": ["1", "2", "3"],
    "case_id": ["1", "2", "abc"],
    "gdwh_portfolio_id": ["d", "e", "f"],
    "extract_date": [date(2024,1,1), date(2024,1,2), None]
})

try:
    CaseSchema.validate(invalid_lf, lazy=True)
except pa.errors.SchemaErrors as e:
        print(json.dumps(e.message, indent=4))

It gives: 'conversion from struct[29] to str failed in column 'failure_case' for 1 out of 1 values [{"abc","f",null}]
If you uncomment "case_id": ["1", "2", "3"], and comment "case_id": ["1", "2", "abc"] it runs fine.

Not sure why it panics when there are nulls. If there are no nulls in the data it works fine.

The trace we get is:


> Traceback (most recent call last):
>   File "<frozen runpy>", line 198, in _run_module_as_main
>   File "<frozen runpy>", line 88, in _run_code
>   File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/erehoba-acc-payments-req/code/Users/ourrehman/dna-payments-and-accounts/data_validation/test.py", line 22, in <module>
>     CaseSchema.validate(invalid_lf, lazy=True)
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/api/dataframe/model.py", line 289, in validate
>     cls.to_schema().validate(
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/api/polars/container.py", line 58, in validate
>     output = self.get_backend(check_obj).validate(
>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 65, in validate
>     check_obj = parser(check_obj, *args)
>                 ^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 398, in coerce_dtype
>     check_obj = self._coerce_dtype_helper(check_obj, schema)
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 486, in _coerce_dtype_helper
>     raise SchemaErrors(
>           ^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/errors.py", line 183, in __init__
>     ).failure_cases_metadata(schema.name, schema_errors)
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/base.py", line 173, in failure_cases_metadata
>     ).cast(
>       ^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/polars/dataframe/frame.py", line 6624, in cast
>     return self.lazy().cast(dtypes, strict=strict).collect(_eager=True)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1810, in collect
>     return wrap_df(ldf.collect())
>                    ^^^^^^^^^^^^^
> polars.exceptions.ComputeError: conversion from `struct[3]` to `str` failed in column 'failure_case' for 1 out of 1 values: [{"abc","f",null}]

Expected behavior

It should work with column that have null and are set nullable=True

versions

pandera: 0.19.0b3
polars: 0.20.23
python: 3.11

I'm not a pandera user - but this is my understanding of why it is failing:

It seems the failure_case column can be a string or a struct.

In the case of a struct, this fails:

).cast(
{
"failure_case": pl.Utf8,

import polars as pl

df = pl.DataFrame({
    'failure_case': [{'case_id': 'abc', 'extract_date': None}]
})

df.with_columns(pl.col("failure_case").cast(pl.String))
# ComputeError: conversion from `struct[2]` to `str` failed in column ...

A struct can be "stringified" in Polars via .struct.json_encode()

>>> df.with_columns(pl.col("failure_case").struct.json_encode())
shape: (1, 1)
┌───────────────────────────────────────┐
│ failure_case                          │
│ ---                                   │
│ str                                   │
╞═══════════════════════════════════════╡
│ {"case_id":"abc","extract_date":null} │
└───────────────────────────────────────┘

But I'm not sure if that's what pandera wants to do in this case.

Good catch! #1608 should address this

Hi @cosmicBboy

I was previously using the 0.19.3b that I installed using
Pip install pre ‘pandera[polars]’

I dnt see the new tag woth your PR.
Can you please let me know how do I use/install the updatws you have made in this PR?

Thanks!