unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pandera Polars datatype 'check' method is not provided a 'data_container'

cstabnick-datasense opened this issue · comments

Describe the bug
When registering a custom datatype in the polars_engine, the check function is only able to do dtype validation, not validation across the data_container like we are able to do in the example here: https://pandera.readthedocs.io/en/stable/dtypes.html#logical-data-types.

I am curious if this is intentional, or if this validation would be a desired enhancement. If so, I would be happy to create a PR for this change. Thank you.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from typing import Optional, Union, Iterable

import pandera.polars as pa
import polars as pl
from pandera import dtypes
from pandera.engines import polars_engine
from pandera.engines.polars_engine import PolarsDataContainer


@polars_engine.Engine.register_dtype
class MyLeadingIDStringType(polars_engine.Object):
    def check(self, pandera_dtype: dtypes.DataType, data_container: Optional[PolarsDataContainer] = None) -> Union[bool, Iterable[bool]]:
        if data_container is not None:
            # unreachable, data_container is always None
            print("data_container is not None")
            if key := data_container.key:
                if (
                    len(data_container.collect().select(pl.col(key).str.starts_with("id_").arg_true()))
                    == data_container.select(pl.col(key).len()).collect().item()
                ):
                    return True
                else:
                    return False
            else:
                raise NotImplementedError("Dataframe case unhandled")
        else:
            return True


class MySchema(pa.DataFrameModel):
    id: MyLeadingIDStringType


data = pl.DataFrame({"id": ["id_1", "id_2", "id_3", "id_4"]})
x = MySchema.validate(data.lazy())
print(x.collect())

Expected behavior

I am expecting my 'check' method to be provided a reference to the data_container when working with the polars_engine, similar to how the pandas implementation works.

Desktop (please complete the following information):

  • OS: Windows 11
  • Browser:
  • Version: pandera==0.19.0b0
  • Python==3.11.5
  • polars==0.20.16

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

This is because the call to our schema.dtype.check is not provided the check_obj here:

passed=schema.dtype.check(obj_dtype),

Where the pandas implementation is providing this check_obj here
dtype_check_results = schema.dtype.check(
Engine.dtype(check_obj.dtype),
check_obj,
)

 

this is a bug! looking into it

Okay, so this should fix it:

from pandera.api.polars.types import PolarsData

...
              CoreCheckResult(
                    passed=schema.dtype.check(
                        obj_dtype,
                        PolarsData(check_obj_subset, schema.selector),
                    ),
           ...
             )

Would you be able to open up a PR with some unit tests?

Yes, I will be happy to do so later this evening. Thank you very much

@cosmicBboy I have created the pr for this change.
#1622
Please let me know if you would like any changes.
Thank you

@cosmicBboy

Apologies if you clicked that while it was deleted, recreated to address the signoff requirement #1623