unionai-oss / pandera

Describe the bug
When registering a custom datatype in the polars_engine, the check function is only able to do dtype validation, not validation across the data_container like we are able to do in the example here: https://pandera.readthedocs.io/en/stable/dtypes.html#logical-data-types.

I am curious if this is intentional, or if this validation would be a desired enhancement. If so, I would be happy to create a PR for this change. Thank you.

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from typing import Optional, Union, Iterable

import pandera.polars as pa
import polars as pl
from pandera import dtypes
from pandera.engines import polars_engine
from pandera.engines.polars_engine import PolarsDataContainer


@polars_engine.Engine.register_dtype
class MyLeadingIDStringType(polars_engine.Object):
    def check(self, pandera_dtype: dtypes.DataType, data_container: Optional[PolarsDataContainer] = None) -> Union[bool, Iterable[bool]]:
        if data_container is not None:
            # unreachable, data_container is always None
            print("data_container is not None")
            if key := data_container.key:
                if (
                    len(data_container.collect().select(pl.col(key).str.starts_with("id_").arg_true()))
                    == data_container.select(pl.col(key).len()).collect().item()
                ):
                    return True
                else:
                    return False
            else:
                raise NotImplementedError("Dataframe case unhandled")
        else:
            return True


class MySchema(pa.DataFrameModel):
    id: MyLeadingIDStringType


data = pl.DataFrame({"id": ["id_1", "id_2", "id_3", "id_4"]})
x = MySchema.validate(data.lazy())
print(x.collect())

Expected behavior

I am expecting my 'check' method to be provided a reference to the data_container when working with the polars_engine, similar to how the pandas implementation works.

Desktop (please complete the following information):

OS: Windows 11
Browser:
Version: pandera==0.19.0b0
Python==3.11.5
polars==0.20.16

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

This is because the call to our schema.dtype.check is not provided the check_obj here:

pandera/pandera/backends/polars/components.py

Line 325 in 612d25c

passed=schema.dtype.check(obj_dtype),

Where the pandas implementation is providing this check_obj here

pandera/pandera/backends/pandas/array.py

Lines 286 to 289 in 612d25c

    
           dtype_check_results = schema.dtype.check( 
        
               Engine.dtype(check_obj.dtype), 
        
               check_obj, 
        
           )

this is a bug! looking into it

Okay, so this should fix it:

from pandera.api.polars.types import PolarsData

...
              CoreCheckResult(
                    passed=schema.dtype.check(
                        obj_dtype,
                        PolarsData(check_obj_subset, schema.selector),
                    ),
           ...
             )

Would you be able to open up a PR with some unit tests?

Yes, I will be happy to do so later this evening. Thank you very much

@cosmicBboy I have created the pr for this change.
#1622
Please let me know if you would like any changes.
Thank you

@cosmicBboy

Apologies if you clicked that while it was deleted, recreated to address the signoff requirement #1623

fixed by #1623

	dtype_check_results = schema.dtype.check(
	Engine.dtype(check_obj.dtype),
	check_obj,
	)

Pandera Polars datatype 'check' method is not provided a 'data_container'

Code Sample, a copy-pastable example

Expected behavior

Desktop (please complete the following information):

Screenshots

Additional context