unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve strategies internals: accumulate check statisics instead of filtering

cosmicBboy opened this issue · comments

Is your feature request related to a problem? Please describe.

Currently, the way pandera converts multiple checks into strategies is to use filters in hypothesis. This is inefficient and causes slowdowns and low-entropy samples, see #1579

Describe the solution you'd like

We'd like some way of accumulating check statistics/constraints (the values users provide in checks, e.g. in Check.ge(0), 0 would be the check statistic before defining the element strategy of a particular column in a dataframe. This would obviate the need to use filters.

This might be implemented as a class that maintains the state of all the check statistics and then

from hypothesis.strategies import SearchStrategy

class Strategy():
    def __init__():
        self.check_statistics = {}

    def add(check: pa.Check):
        # translate check statistics into args/kwargs to be fed into
        # hypothesis strategy
        self.check_statistics["arg"] = <value>

    def element() -> SearchStrategy:
        # returns a search strategy for a single element in the
        # dataframe column
        ...

Describe alternatives you've considered

An alternative approach would be some kind of functional API that accumulates the check constraints, ultimately producing a hypothesis SearchStrategy.

Additional context

It would also be nice to come up with a nicer user-facing API to define custom strategies