TalkingData AdTracking Fraud Detection Challenge

models and scores

model definition can be found in scripts/model_lib.py

model1 LGBM with 83 (76 numerical, 7 categorical) features.
model2 keras with 27(18 numerical, 9 categorical) features, You can see network structure in model.png

model	private score	public score
model1	0.9836325	0.9828896
model2	0.9830595	0.9822785

Most of these features have already been discussed on the kaggle forum.

counting features
- mk_feat_count.py
- mk_feat_count_time.py
- mk_feat_countRatio.py
cumulative count
- mk_feat_cumcount.py
- mk_feat_recumcount.py
- mk_feat_cumratio.py
time to next click
- mk_feat_nextClick_leak_day.py
- mk_feat_nextClick_filter.py
time bucket count.(make multiple time intervals, and count the number of buckets which the IP exists)
- mk_feat_rangecount.py
- mk_feat_rangecount_minute.py
variance
- mk_feat_var.py
common IP
- mk_feat_common_ip.py
unique count
- mk_feat_uniq_count2.py
target encoding: woe
- mk_feat_woe_all_prev.py
- mk_feat_woe_bound.py

Features will be calculated once and saved to disk.

Importance from LGBM is found in importance.txt.

I used following environment

Hardware:

Python3 packages:

At first, put sample_submission.csv test.csv test_supplement.csv train.csv to input directory.

Then run shell scripts as follows,

$ cd scripts/

$ ./run_mk_feats.sh

$ ./run_mk_model1.sh

$ ./run_mk_model2.sh

Output prediction files will be in csv directory.

It took about one day for feature extraction(run_mk_feats.sh).

It needs large memory(~256GB) to build model1(run_mk_model1.sh), sorry.

GPU is required to build model2(run_mk_model2.sh)

TalkingData AdTracking Fraud Detection Challenge

Apache License 2.0

Language:Python 97.5%Language:Shell 2.5%