gabrielspmoreira / chameleon_recsys

Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Untraceable config number in nar_preprocess_adressa.py

Heng-xiu opened this issue · comments

Screenshot from 2020-02-01 16-21-06

Hi there,

Can anyone help me to find out where are these numbers come from?
For example, '_elapsed_ms_since_last_click' is an untraceable config. I cannot figure out why we should fill 1371436 for stddev or 789935.7 for avg.

Hi. Those hardcoded stats (mean, stddev) are used to apply z-normalization on those numeric features, and came from the the first pre-processing step for Adressa dataset, using PySpark
nar_module/scripts/dataproc_preprocessing/nar_preprocessing_addressa_01_dataproc.ipynb .
Ideally, they should be output to a file by the PySpark pipeline and then reused by the nar_preprocess_adressa.py instead of being hardcoded as they are now.