unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

Home Page:https://www.union.ai/pandera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Passing DataFrameSchema to function that check_output is decorating?

dwinski opened this issue · comments

Is there a way to pass a DataFrameSchema as an argument to the function that check_output (or other check decorator) is decorating? In the examples in the documentation (https://pandera.readthedocs.io/en/stable/decorators.html) the function being decorated is only passed the dataframe to be processed (and validated) and the DataFrameSchema is assigned in the global scope. I'd ideally like to be able to load the DataFrameSchema from a yaml file inside the main function of the script and then make calls to helper dataframe processing functions that make use of validation decorators. To do this I'd need to pass the DataFrameSchema I loaded from yaml as an argument to the dataframe processing functions where the decorators could somehow also access the schema. I'm think of something like code below (if it were possible). Not sure if I'm missing some obvious solution or if I need to create a workaround (maybe define an inner function that gets decorated?) to use this workflow in my script.

import pandas as pd
from pandera import check_output
from pandera.io import from_yaml


@check_output(output_schema)
def load_data(path_to_data, output_schema):
    df = pd.read_csv(path_to_data)
    return(df)
    

def main(config_file):
     '''  Main function for script.  Note that config file is a yaml file with read/write paths for script  '''
     
      # load DataFrameSchema from yaml for validating raw data
      raw_data_schema = from_yaml(config_file['input_schema'])

     # load raw data and pass validation schema to decorator
      raw_data_df = load_data(config_file['raw_data_path'], raw_data_schema)

      #....do more data processing in further steps after raw data has been validated