ThyrixYang / es_dfm

Implementation and experimental comparison of ES-DFM (Yang et al. 2021), Delayed feedback model(DFM, Chapelle 2014), Feedback Shift Importance Weighting (FSIW) (Yasui et al. 2020), Fake Negative Weighted (FNW) (Ktena et al. 2019) and Fake Negative calibration(FNC) (Ktena et al. 2019)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Advice for tuning the hyper-parameters of the initial model

hwlza opened this issue · comments

commented

Hi, Jia-Qi

I tried to reproduce your experiments on the criteo dataset, mainly comparing your proposed ESDFM approach against the Pre-trained baseline. However, I found there exists some gap between the results and the values you reported in Table 2. Do you have any advice for tuning the hyper-parameters of the initial model?

I attach my running scripts & stats. here for reference.
(1) First, I pre-trained a standard MLP model for CTR prediction using by executing the following command:

python ./src/main.py --method Pretrain \
                    --mode pretrain \
                    --model_ckpt_path ./work_dir/ckpts/pretrain/baseline/baseline \
                    --data_path ./data/criteo/data.txt \
                    --data_cache ./work_dir/data_cache 

(2) Then, I trained the importance weighter classifiers as follows:

python ./src/main.py --method ES-DFM \
                    --mode pretrain \
                    --model_ckpt_path ./work_dir/ckpts/pretrain/esdfm/esdfm \
                    --data_path ./data/criteo/data.txt \
                    --data_cache ./work_dir/data_cache 

(3) Finally, we can train the ESDFM model on the simulated stream data as:

python ./src/main.py --method ES-DFM \
                    --mode stream \
                    --pretrain_baseline_model_ckpt_path ./work_dir/ckpts/pretrain/baseline/baseline  \
                    --pretrain_esdfm_model_ckpt_path ./work_dir/ckpts/pretrain/esdfm/esdfm \
                    --data_path ./data/criteo/data.txt \
                    --data_cache ./work_dir/data_cache

After the step-3, I found the values of the metrics are: AUC(0.838215320158801)|PR-AUC(0.637005309537314)|NLL(0.394365247095086), which show some differences to the values you reported in Table 2 of the paper. Since the seed that controls the weight initialization would affect the final results a bit, I ran steps 1&2 above using another seed and ran 4 combinations of that in step 3. But the metric differences still exist.

image

Thank you.

commented

Hi @hwlza ,

We did not fix the random seed during training of the initial model, and I think this difference is mainly caused by randomness.
Nevertheless, such difference will not affect comparison between delayed feedback methods, since their performance will shift to higher or lower simultaneously. The conclusions will not be affected.
You can just use the pretrained checkpoints to achieve the reported performance in the paper if necessary.

commented

My results under different seeds vary less than 2e-4, but the difference caused by different initial models varies about 2e-3. Any chance to share the hyper-parameters setting of your initial model?

commented

My results under different seeds vary less than 2e-4, but the difference caused by different initial models varies about 2e-3. Any chance to share the hyper-parameters setting of your initial model?

I think you can try to tune the learning rate since it's the only tunable hyper-parameter. I'm not quite sure about the exact value we used three years ago.

commented

Okay, will try it. Thx