unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Custom check erroneously passes when validating `pl.LazyFrame`

philiporlando opened this issue · comments

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

I've created a custom check function that should never return True based on my sample data. However, pandera does not raise an error when validating the fruit column. This may be related to #1565.

import polars as pl
import pandera.polars as pa


# Custom check function
def check_len(v: str) -> bool:
    return len(v) == 20

schema = pa.DataFrameSchema(
    {
        "fruit": pa.Column(
            dtype=str,
            checks=pa.Check(check_len, element_wise=True),
        ),
    }
)

lf = pl.LazyFrame(
    {
        "fruit": ["apple", "pear", "banana"],
    }
)

lf.pipe(schema.validate).collect()
# shape: (3, 1)
# ┌────────┐
# │ fruit  │
# │ ---    │
# │ str    │
# ╞════════╡
# │ apple  │
# │ pear   │
# │ banana │
# └────────┘

Converting from LazyFrame to DataFrame before performing the schema validation appears to raise the expected error:

import polars as pl
import pandera.polars as pa


# Custom check function
def check_len(v: str) -> bool:
    return len(v) == 20

schema = pa.DataFrameSchema(
    {
        "fruit": pa.Column(
            dtype=str,
            checks=pa.Check(check_len, element_wise=True),
        ),
    }
)

lf = pl.LazyFrame(
    {
        "fruit": ["apple", "pear", "banana"],
    }
)

df = lf.collect()
df.pipe(schema.validate)
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:74: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   passed = check_result.check_passed.collect().item()
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:88: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   failure_cases = check_result.failure_cases.collect()
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:112: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   check_output=check_result.check_output.collect(),
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "C:\local\.venv\Lib\site-packages\polars\dataframe\frame.py", line 5150, in pipe
#     return function(self, *args, **kwargs)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\api\polars\container.py", line 58, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 114, in validate
#     error_handler.collect_error(
#   File "C:\local\.venv\Lib\site-packages\pandera\api\base\error_handler.py", line 54, in collect_error
#     raise schema_error from original_exc
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 182, in run_schema_component_checks
#     result = schema_component.validate(check_obj, lazy=lazy)
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\api\polars\components.py", line 141, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 81, in validate
#     error_handler = self.run_checks_and_handle_errors(
#                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 147, in run_checks_and_handle_errors
#     error_handler.collect_error(
#   File "C:\local\.venv\Lib\site-packages\pandera\api\base\error_handler.py", line 54, in collect_error
#     raise schema_error from original_exc
# pandera.errors.SchemaError: Column 'fruit' failed validator number 0: <Check check_len> failure case examples: [{'fruit': 'apple'}, {'fruit': 'pear'}, {'fruit': 'banana'}]

Expected behavior

I would expect to see a schema validation error raised with the LazyFrame here since none of the fruit values have a string length of 20 characters.

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser Chrome
  • Version pandera==0.19.0b1

See the docs here https://pandera.readthedocs.io/en/latest/polars.html#error-reporting

This is intended behavior: LazyFrame validation will only to schema-level checks (so as not to materialize the data in a lazy method chain). Currently, pandera assumes that all custom checks operate on data. You can force data-level checks by explicitly setting export PANDERA_VALIDATION_ENABLED=SCHEMA_AND_DATA.

Is this a duplicate of #1565?

See the docs here https://pandera.readthedocs.io/en/latest/polars.html#error-reporting

This is intended behavior: LazyFrame validation will only to schema-level checks (so as not to materialize the data in a lazy method chain). Currently, pandera assumes that all custom checks operate on data. You can force data-level checks by explicitly setting export PANDERA_VALIDATION_ENABLED=SCHEMA_AND_DATA.

This is super helpful and makes total sense. Thanks for the feedback.

Is this a duplicate of #1565?

I don't think so. The error that I'm experiencing in #1565 is specific to pl.DataFrame.

Gotcha, yeah looks like a bug, looking.

@philiporlando would it make sense to add some logging at validation time to explicitly say what types of checks are being run? If so, would it make sense as logging.info, debug or something else?

@philiporlando would it make sense to add some logging at validation time to explicitly say what types of checks are being run? If so, would it make sense as logging.info, debug or something else?

I'm in favor of this! At the very least, I think it would be helpful to communicate which data-level checks are ignored whenever a LazyFrame is validated instead of a DataFrame. It might even make sense to log a warning here?

Gotcha, yeah looks like a bug, looking.

Thank you for looking into it!