iwii0425 / ftrl

follow-the-regularized-leader implemented by java, with an example using criteo dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A Java Implementation of FTRL and An Example with Criteo Dataset
==================================================================


Get The Dataset
=================
refer to: https://github.com/guestwalk/kaggle-2014-criteo/blob/master/README, 'Get The Dataset' part.


Run main_script.sh
===================
1. transform data
code in main_script---------
./feature_engineer/count.py train.csv > fc.trva.t10.txt
./feature_engineer/parallel_td.py -s 15 ./feature_engineer/transform_data.py train.csv train_trans.csv
----------------------------
features(I1-I13) greater than 2 are transformed by: int(log(v)^2)->v
features(C1-C26) appear less than 10 times are transformed to a special value

2. generate samples
code in main_script---------
python feature_engineer/stat_field_info.py train_trans.csv st_info_file > /dev/null
./feature_engineer/parallel_gs.py -s 15 -m 1 ./feature_engineer/get_samples.py train_trans.csv train_samples_m1
./feature_engineer/parallel_gs.py -s 15 -m 2 ./feature_engineer/get_samples.py train_trans.csv train_samples_m2
----------------------------
when -m 1, using a hash trick, features are mapped to [1, 1000000]
when -m 2, no hash trick, features are mapped to [1, 1086921]

3. train ftrl model and progress validation
main-class: criteo.FTRLLocalTrain.java
usage: java -jar ftrl_train.jar <input file> <L1> <L2> <alpha> <data_max_index>
code in main_script---------
java -jar ftrl_train.jar train_samples_m1 1.0 1.0 0.1 1000000 > m1_result
java -jar ftrl_train.jar train_samples_m2 1.0 1.0 0.1 1086921 > m2_result
----------------------------
use ftrl to train the model, use progressive validation
progressive validation: get a sample -> predict -> loss -> update model -> get a new sample -> ...
every 50k samples are processed, the program will print used time and average logloss of latest 250k samples
global average logloss: about 0.457

4. plot the average logloss
python plot_progress_validation.py


Use feature combination
=======================
code in main_script---------
./feature_engineer/parallel_gs.py -s 15 -m 3 ./feature_engineer/get_samples.py train_trans.csv train_samples_m3
java -jar ftrl_train.jar train_samples_m3 1.0 1.0 0.1 6086921 > m3_result
----------------------------
need large disk space, take care!
because my feature combination method is too simple, feature space is very large, overfitting will happen, there are also many collisions when using hash trick, so the performance is bad, here is some results in my experiment:
L2 change, alpha->0.1, L1->1.0
0.5      0.49062
1.5      0.48979
3.0      0.48862
10.0     0.48419
L1 change, alpha->0.1, L2->1.0
0.5      0.49628
1.5      0.48534
3.0      0.47539
10.0     0.45985
15.0     0.45689
20.0     0.45558
30.0     0.45473
50.0     0.45498

About

follow-the-regularized-leader implemented by java, with an example using criteo dataset.


Languages

Language:PLSQL 98.8%Language:Python 0.7%Language:Java 0.4%Language:Shell 0.1%