unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Polars LazyFrame validation only does schema checks, DataFrame validation does full validation

cosmicBboy opened this issue · comments

Is your feature request related to a problem? Please describe.

This feature is to polish the polars API to clarify how pandera is validating data. The problem is that the current behavior in the beta feature is that calling schema.validate on a LazyFrame performs collect() calls and converts back to lazy() before returning the validation output.

schema = pa.DataFrameSchema({"a": pa.Column(int)})

df = (
    pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
    .cast({"a": pl.Int64})
    .pipe(schema.validate) # this calls .collect() on the LazyFrame
                           # and calls .lazy() before returning
                           # the output
    .with_columns(b=pl.lit("a"))
    # do more lazy operations
    .collect()
)
print(df)

Describe the solution you'd like

The above behavior is not transparent and potentially a footgun, since operations before and after the schema.validate call break the lazy API method chain, meaning the query cannot be optimized end-to-end.

To put control of when data is materialized into the user's hands, the proposal of this issue is to change the behavior of schema.validate depending on whether you feed it a LazyFrame or DataFrame. Since pandera already differentiates between schema-level and data-level validations (see details here), we can do the following:

schema1 = DataFrameSchema(...)
schema2 = DataFrameSchema(...) 

df = (
    pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
    .pipe(schema1.validate)  # only performs schema-level validations
    # do more lazy operations
    .collect()
    .pipe(schema2.validate)  # performs both schema-level and data-level validations
)
print(df)

Where Schema-level validations are checks on metadata, e.g.:

  • Checking for column presence
  • Verifying column data types
  • Ensure column ordering

Data-level validations, as the name suggests, are checks that inspect actual data values, e.g.:

  • Checking that integer values of a column are positive numbers
  • Making sure that string values are drawn from the set {“Apple”, “Orange”, “Banana”}
  • Check that float point values are probabilities between 0.0 and 1.0

Describe alternatives you've considered

Another idea would be to make the differing behavior more explicit by registering pandera-specific methods in the DataFrame and LazyFrame objects:

# LazyFrame behavior
pl.LazyFrame({"a": [1.0, 2.0, 3.0]}).pandera.validate(schema)  # schema-only validation, coerces if coerce=True
pl.LazyFrame({"a": [1.0, 2.0, 3.0]}).pandera.cast(schema)  # coerce datatypes only, even if coerce=False

# DataFrame behavior
pl.DataFrame({"a": [1.0, 2.0, 3.0]}).pandera.validate(schema)  # schema- and data-level validation, coerces if coerce=True
pl.DataFrame({"a": [1.0, 2.0, 3.0]}).pandera.cast(schema)  # coerce datatypes only, even if coerce=False

Note this would be syntactic sugar on the proposed solution.

this is a great idea