failure_case conversion failed : polars.exceptions.ComputeError - pandera(0.19.0b3) with polars
obiii opened this issue · comments
Describe the bug
We are trying a simple validation example using polars. We cant understand the problem or why it originates. But it throws polars.exceptions.ComputeError exception when any of the validation fails and there is null in data.
For example, in the code below, the dummy data contains extract_date feature with a None. It runs fine if the case_id are all int convertible string but throws the exception if any of the case_id is not int convertible.
Here is the code:
import pandera.polars as pa
import polars as pl
from datetime import date
import json
class CaseSchema(pa.DataFrameModel):
case_id: int = pa.Field(nullable=False, unique=True, coerce=True)
gdwh_portfolio_id: str = pa.Field(nullable=False, unique=True, coerce=True)
extract_date: date = pa.Field(nullable=True, coerce=True)
class Config:
drop_invalid_rows = True
invalid_lf = pl.DataFrame({
#"case_id": ["1", "2", "3"],
"case_id": ["1", "2", "abc"],
"gdwh_portfolio_id": ["d", "e", "f"],
"extract_date": [date(2024,1,1), date(2024,1,2), None]
})
try:
CaseSchema.validate(invalid_lf, lazy=True)
except pa.errors.SchemaErrors as e:
print(json.dumps(e.message, indent=4))
It gives: 'conversion from struct[29]
to str
failed in column 'failure_case' for 1 out of 1 values [{"abc","f",null}]
If you uncomment "case_id": ["1", "2", "3"]
, and comment "case_id": ["1", "2", "abc"]
it runs fine.
Not sure why it panics when there are nulls. If there are no nulls in the data it works fine.
The trace we get is:
> Traceback (most recent call last):
> File "<frozen runpy>", line 198, in _run_module_as_main
> File "<frozen runpy>", line 88, in _run_code
> File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/erehoba-acc-payments-req/code/Users/ourrehman/dna-payments-and-accounts/data_validation/test.py", line 22, in <module>
> CaseSchema.validate(invalid_lf, lazy=True)
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/api/dataframe/model.py", line 289, in validate
> cls.to_schema().validate(
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/api/polars/container.py", line 58, in validate
> output = self.get_backend(check_obj).validate(
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 65, in validate
> check_obj = parser(check_obj, *args)
> ^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 398, in coerce_dtype
> check_obj = self._coerce_dtype_helper(check_obj, schema)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 486, in _coerce_dtype_helper
> raise SchemaErrors(
> ^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/errors.py", line 183, in __init__
> ).failure_cases_metadata(schema.name, schema_errors)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/base.py", line 173, in failure_cases_metadata
> ).cast(
> ^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/polars/dataframe/frame.py", line 6624, in cast
> return self.lazy().cast(dtypes, strict=strict).collect(_eager=True)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1810, in collect
> return wrap_df(ldf.collect())
> ^^^^^^^^^^^^^
> polars.exceptions.ComputeError: conversion from `struct[3]` to `str` failed in column 'failure_case' for 1 out of 1 values: [{"abc","f",null}]
Expected behavior
It should work with column that have null and are set nullable=True
versions
pandera: 0.19.0b3
polars: 0.20.23
python: 3.11
I'm not a pandera user - but this is my understanding of why it is failing:
It seems the failure_case
column can be a string or a struct.
In the case of a struct, this fails:
pandera/pandera/backends/polars/base.py
Lines 173 to 175 in dbf1831
import polars as pl
df = pl.DataFrame({
'failure_case': [{'case_id': 'abc', 'extract_date': None}]
})
df.with_columns(pl.col("failure_case").cast(pl.String))
# ComputeError: conversion from `struct[2]` to `str` failed in column ...
A struct can be "stringified" in Polars via .struct.json_encode()
>>> df.with_columns(pl.col("failure_case").struct.json_encode())
shape: (1, 1)
┌───────────────────────────────────────┐
│ failure_case │
│ --- │
│ str │
╞═══════════════════════════════════════╡
│ {"case_id":"abc","extract_date":null} │
└───────────────────────────────────────┘
But I'm not sure if that's what pandera wants to do in this case.
Good catch! #1608 should address this
Hi @cosmicBboy
I was previously using the 0.19.3b that I installed using
Pip install pre ‘pandera[polars]’
I dnt see the new tag woth your PR.
Can you please let me know how do I use/install the updatws you have made in this PR?
Just cut a new beta release: https://github.com/unionai-oss/pandera/releases/tag/v0.19.0b4
Thanks!