Lazy schema validation does not raise expected errors with polars dataframes
philiporlando opened this issue · comments
Describe the bug
I have a polars dataframe that should raise multiple schema validation errors. I want to see all of the errors at once, so I'm setting lazy=True
when performing schema validation. Currently, none of the expected errors are returned. Additionally, an unexpected attribute error is returned when switching from LazyFrame to DataFrame.
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandera.
- (optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
This code sample should raise schema validation errors on multiple columns but nothing is returned.
import pandera.polars as pa
from pandera.polars import Check, Column
import polars as pl
x = pl.LazyFrame(
{
"foo": ["bar", "baz", "test", "tester"],
"fruit": ["strawberry", "pear", "banana", "apple"],
"fruit2": ["strawberry", "pear", "banana", None],
}
)
s = pa.DataFrameSchema(
{
"foo": Column(str, Check.str_length(max_value=4), required=True),
"fruit": Column(
str,
checks=Check.isin(["apple", "strawberry", "pear"]),
nullable=False,
),
"fruit2": Column(
str,
checks=Check.isin(["apple", "strawberry", "pear"]),
nullable=False,
),
}
)
try:
s.validate(x, lazy=True).collect() # should raise errors on all three columns
except pa.errors.SchemaErrors as e:
print(e.failure_cases)
# Nothing is returned...
When I switch from using a LazyFrame to a DataFrame, I see this error: AttributeError: 'NoneType' object has no attribute 'with_row_count'
import pandera.polars as pa
from pandera.polars import Check, Column
import polars as pl
x = pl.DataFrame(
{
"foo": ["bar", "baz", "test", "tester"],
"fruit": ["strawberry", "pear", "banana", "apple"],
"fruit2": ["strawberry", "pear", "banana", None],
}
)
s = pa.DataFrameSchema(
{
"foo": Column(str, Check.str_length(max_value=4), required=True),
"fruit": Column(
str,
checks=Check.isin(["apple", "strawberry", "pear"]),
nullable=False,
),
"fruit2": Column(
str,
checks=Check.isin(["apple", "strawberry", "pear"]),
nullable=False,
),
}
)
try:
s.validate(x, lazy=True).collect() # should raise errors on all three columns
except pa.errors.SchemaErrors as e:
print(e.failure_cases)
# Traceback (most recent call last):
# File "C:\local\project\test_polars_validate_lazy_true.py", line 26, in <module>
# s.validate(x, lazy=True).collect() # should raise errors on all three columns
# ^^^^^^^^^^^^^^^^^^^^^^^^
# File "C:\local\project\.venv\Lib\site-packages\pandera\api\polars\container.py", line 58, in validate
# output = self.get_backend(check_obj).validate(
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "C:\local\project\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 92, in validate
# results = check(*args) # type: ignore[operator]
# ^^^^^^^^^^^^
# File "C:\local\project\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 182, in run_schema_component_checks
# result = schema_component.validate(check_obj, lazy=lazy)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "C:\local\project\.venv\Lib\site-packages\pandera\api\polars\components.py", line 143, in validate
# output = self.get_backend(check_obj).validate(
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "C:\local\project\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 95, in validate
# raise SchemaErrors(
# ^^^^^^^^^^^^^
# File "C:\local\project\.venv\Lib\site-packages\pandera\errors.py", line 183, in __init__
# ).failure_cases_metadata(schema.name, schema_errors)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "C:\local\project\.venv\Lib\site-packages\pandera\backends\polars\base.py", line 151, in failure_cases_metadata
# index = err.check_output.with_row_count("index").filter(
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# AttributeError: 'NoneType' object has no attribute 'with_row_count'
Expected behavior
I would expect the output to match what the pandas dataframe returns:
import pandas as pd
import pandera as pa
from pandera import Check, Column
x = pd.DataFrame(
{
"foo": ["bar", "baz", "test", "tester"],
"fruit": ["strawberry", "pear", "banana", "apple"],
"fruit2": ["strawberry", "pear", "banana", None],
}
)
s = pa.DataFrameSchema(
{
"foo": Column(str, Check.str_length(max_value=4), required=True),
"fruit": Column(
str,
checks=Check.isin(["apple", "strawberry", "pear"]),
nullable=False,
),
"fruit2": Column(
str,
checks=Check.isin(["apple", "strawberry", "pear"]),
nullable=False,
),
}
)
try:
s.validate(x, lazy=True)
except pa.errors.SchemaErrors as e:
print(e.failure_cases)
# schema_context column check check_number failure_case index
# 0 Column foo str_length(None, 4) 0 tester 3
# 1 Column fruit isin(['apple', 'strawberry', 'pear']) 0 banana 2
# 2 Column fruit2 not_nullable None None 3
# 3 Column fruit2 isin(['apple', 'strawberry', 'pear']) 0 banana 2
Desktop (please complete the following information):
- OS:
Windows 10
- Browser:
Chrome
- Version:
pandera@git+https://github.com/unionai-oss/pandera.git@870d74a79546ee0ac019d3c347bd31643a66f1cc
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
Thanks for finding this one! See: #1586
Please keep these bug reports coming! It helps to iron these out before the stable 0.19.0 release
Just pulled the changes within #1586 and can confirm that the expected output is returned! Thanks for addressing these so quickly!