Pandera Polars datatype 'check' method is not provided a 'data_container'
cstabnick-datasense opened this issue · comments
Describe the bug
When registering a custom datatype in the polars_engine, the check function is only able to do dtype validation, not validation across the data_container like we are able to do in the example here: https://pandera.readthedocs.io/en/stable/dtypes.html#logical-data-types.
I am curious if this is intentional, or if this validation would be a desired enhancement. If so, I would be happy to create a PR for this change. Thank you.
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandera.
- (optional) I have confirmed this bug exists on the main branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
from typing import Optional, Union, Iterable
import pandera.polars as pa
import polars as pl
from pandera import dtypes
from pandera.engines import polars_engine
from pandera.engines.polars_engine import PolarsDataContainer
@polars_engine.Engine.register_dtype
class MyLeadingIDStringType(polars_engine.Object):
def check(self, pandera_dtype: dtypes.DataType, data_container: Optional[PolarsDataContainer] = None) -> Union[bool, Iterable[bool]]:
if data_container is not None:
# unreachable, data_container is always None
print("data_container is not None")
if key := data_container.key:
if (
len(data_container.collect().select(pl.col(key).str.starts_with("id_").arg_true()))
== data_container.select(pl.col(key).len()).collect().item()
):
return True
else:
return False
else:
raise NotImplementedError("Dataframe case unhandled")
else:
return True
class MySchema(pa.DataFrameModel):
id: MyLeadingIDStringType
data = pl.DataFrame({"id": ["id_1", "id_2", "id_3", "id_4"]})
x = MySchema.validate(data.lazy())
print(x.collect())
Expected behavior
I am expecting my 'check' method to be provided a reference to the data_container when working with the polars_engine, similar to how the pandas implementation works.
Desktop (please complete the following information):
- OS: Windows 11
- Browser:
- Version: pandera==0.19.0b0
- Python==3.11.5
- polars==0.20.16
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
This is because the call to our schema.dtype.check is not provided the check_obj here:
pandera/pandera/backends/polars/components.py
Line 325 in 612d25c
Where the pandas implementation is providing this check_obj here
pandera/pandera/backends/pandas/array.py
Lines 286 to 289 in 612d25c
this is a bug! looking into it
Okay, so this should fix it:
from pandera.api.polars.types import PolarsData
...
CoreCheckResult(
passed=schema.dtype.check(
obj_dtype,
PolarsData(check_obj_subset, schema.selector),
),
...
)
Would you be able to open up a PR with some unit tests?
Yes, I will be happy to do so later this evening. Thank you very much
@cosmicBboy I have created the pr for this change.
#1622
Please let me know if you would like any changes.
Thank you
Apologies if you clicked that while it was deleted, recreated to address the signoff requirement #1623
fixed by #1623