Improve strategies internals: accumulate check statisics instead of filtering
cosmicBboy opened this issue · comments
Is your feature request related to a problem? Please describe.
Currently, the way pandera converts multiple checks into strategies is to use filter
s in hypothesis. This is inefficient and causes slowdowns and low-entropy samples, see #1579
Describe the solution you'd like
We'd like some way of accumulating check statistics/constraints (the values users provide in checks, e.g. in Check.ge(0)
, 0
would be the check statistic before defining the element strategy of a particular column in a dataframe. This would obviate the need to use filter
s.
This might be implemented as a class that maintains the state of all the check statistics and then
from hypothesis.strategies import SearchStrategy
class Strategy():
def __init__():
self.check_statistics = {}
def add(check: pa.Check):
# translate check statistics into args/kwargs to be fed into
# hypothesis strategy
self.check_statistics["arg"] = <value>
def element() -> SearchStrategy:
# returns a search strategy for a single element in the
# dataframe column
...
Describe alternatives you've considered
An alternative approach would be some kind of functional API that accumulates the check constraints, ultimately producing a hypothesis SearchStrategy
.
Additional context
It would also be nice to come up with a nicer user-facing API to define custom strategies