Untraceable config number in nar_preprocess_adressa.py
Heng-xiu opened this issue · comments
Hi. Those hardcoded stats (mean, stddev) are used to apply z-normalization on those numeric features, and came from the the first pre-processing step for Adressa dataset, using PySpark
nar_module/scripts/dataproc_preprocessing/nar_preprocessing_addressa_01_dataproc.ipynb .
Ideally, they should be output to a file by the PySpark pipeline and then reused by the nar_preprocess_adressa.py instead of being hardcoded as they are now.