Untraceable config number in nar_preprocess_adressa.py

Question

Untraceable config number in nar_preprocess_adressa.py

Heng-xiu opened this issue 4 years ago · comments

Hi there,

Can anyone help me to find out where are these numbers come from?
For example, '_elapsed_ms_since_last_click' is an untraceable config. I cannot figure out why we should fill 1371436 for stddev or 789935.7 for avg.

Gabriel Moreira · Answer 1 · Mon Jul 06 2020 03:12:28 GMT+0800 (China Standard Time)

Hi. Those hardcoded stats (mean, stddev) are used to apply z-normalization on those numeric features, and came from the the first pre-processing step for Adressa dataset, using PySpark
nar_module/scripts/dataproc_preprocessing/nar_preprocessing_addressa_01_dataproc.ipynb .
Ideally, they should be output to a file by the PySpark pipeline and then reused by the nar_preprocess_adressa.py instead of being hardcoded as they are now.