unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for `PANDERA_VALIDATION_ENABLED` for pandas

noklam opened this issue · comments

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.
In 0.16.0, PANDERA_VALIDATION_ENABLED was added to disable runtime check. I want to apply the flag to pandas DataFrame as well.

Describe the solution you'd like
A clear and concise description of what you want to happen.
The decorator style of validation is convenient but there are no way to turn it off easily and it introduces runtime cost. The feature already exist for PySpark, and I want it for pandas DataFrame as well.

Currently, only PySpark is respecting this configuration:

if not CONFIG.validation_enabled:
return
error_handler = ErrorHandler(lazy)
return self._validate(
check_obj=check_obj,
head=head,
tail=tail,
sample=sample,
random_state=random_state,
lazy=lazy,
inplace=inplace,
error_handler=error_handler,

Potentially, the logic can be added for pandas

def validate(
self,
check_obj: pd.DataFrame,

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

I haven't made any PR to pandera before, if this direction is correct I could try to make a PR, but I would like to get some feedback first. Please advise what tests are needed and potentially where should I add them.

The PR description and approach is good! Basically we need to:

  1. Add the early return in the pandas API schema/schema components
  2. Add tests similar to the ones here in the pyspark tests.
  3. Update the docs, probably a new page dedicated to configuration (if you can write the content I can help with the structure and formatting)

Sounds good! I will try to finish it this week, if not I will be back in mid Oct.

I just have a quick look, does pandera have something like GitPod or Github Codespace for CDE development? If not I can also create a separate PR to add support for GitPod and maybe add this into the contribution guide as an alternative to build locally.

They have a open source program https://www.gitpod.io/discover/opensource

I think github codespace should just work out of the box, not sure how it installs the virtual environment tho