unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pandas backend: `check_nullable` is inefficient when a column schema has `nullable=True`

smarie opened this issue · comments

check_nullable in the pandas backend seems to compute the null values mask by calling isna() even when not needed.

https://github.com/unionai-oss/pandera/blob/9c484a92cc6e63ba11652444e9e6df9e587d668e/pandera/backends/pandas/array.py#L196C9-L198

isna = check_obj.isna()
passed = schema.nullable or not isna.any()

We see that even if schema.nullable=True, isna is already computed. This can lead to a performance issue in dataframe with millions of rows.

@smarie feel free to make a PR!