Polars LazyFrame validation only does schema checks, DataFrame validation does full validation

Question

Polars LazyFrame validation only does schema checks, DataFrame validation does full validation

cosmicBboy opened this issue 3 months ago · comments

Is your feature request related to a problem? Please describe.

This feature is to polish the polars API to clarify how pandera is validating data. The problem is that the current behavior in the beta feature is that calling schema.validate on a LazyFrame performs collect() calls and converts back to lazy() before returning the validation output.

schema = pa.DataFrameSchema({"a": pa.Column(int)})

df = (
    pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
    .cast({"a": pl.Int64})
    .pipe(schema.validate) # this calls .collect() on the LazyFrame
                           # and calls .lazy() before returning
                           # the output
    .with_columns(b=pl.lit("a"))
    # do more lazy operations
    .collect()
)
print(df)

Describe the solution you'd like

The above behavior is not transparent and potentially a footgun, since operations before and after the schema.validate call break the lazy API method chain, meaning the query cannot be optimized end-to-end.

To put control of when data is materialized into the user's hands, the proposal of this issue is to change the behavior of schema.validate depending on whether you feed it a LazyFrame or DataFrame. Since pandera already differentiates between schema-level and data-level validations (see details here), we can do the following:

schema1 = DataFrameSchema(...)
schema2 = DataFrameSchema(...) 

df = (
    pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
    .pipe(schema1.validate)  # only performs schema-level validations
    # do more lazy operations
    .collect()
    .pipe(schema2.validate)  # performs both schema-level and data-level validations
)
print(df)

Where Schema-level validations are checks on metadata, e.g.:

Checking for column presence
Verifying column data types
Ensure column ordering

Data-level validations, as the name suggests, are checks that inspect actual data values, e.g.:

Checking that integer values of a column are positive numbers
Making sure that string values are drawn from the set {“Apple”, “Orange”, “Banana”}
Check that float point values are probabilities between 0.0 and 1.0

Describe alternatives you've considered

Another idea would be to make the differing behavior more explicit by registering pandera-specific methods in the DataFrame and LazyFrame objects:

# LazyFrame behavior
pl.LazyFrame({"a": [1.0, 2.0, 3.0]}).pandera.validate(schema)  # schema-only validation, coerces if coerce=True
pl.LazyFrame({"a": [1.0, 2.0, 3.0]}).pandera.cast(schema)  # coerce datatypes only, even if coerce=False

# DataFrame behavior
pl.DataFrame({"a": [1.0, 2.0, 3.0]}).pandera.validate(schema)  # schema- and data-level validation, coerces if coerce=True
pl.DataFrame({"a": [1.0, 2.0, 3.0]}).pandera.cast(schema)  # coerce datatypes only, even if coerce=False

Note this would be syntactic sugar on the proposed solution.

Jennings Fantini · Answer 1 · Mon Mar 18 2024 01:14:08 GMT+0800 (China Standard Time)

this is a great idea