marsan-ma/checkins

Kaggle challenge: predicting checkins

The goal of this competition is to predict which place a person would like to check into. For the purposes of this competition, Facebook released 40M check-in data of a small city, and our task is to predict the most likely check-in places of 8M user samples.

Since the dataset of this competition is in a huge scale, it's far more than just pushing predicting the ability of our machine learning algorithm. It also challenging us about how to deal with such a huge scale data. I've developed a lot of new tricks on dividing the question size without losing too much predicting accuracy, and speed-up, parallelize every code detail.

modules

The main modules

parser.py
parse in the raw data, split into training/validation/testing, also doing most of data pre-processing.
trainer.py
training models according to selected algorithm and parameters.
evaluator.py do data post-processing if enabled, evaluate trained models, and generate submittion file.
submiter.py programmatically submit to kaggle website.

wrappers

main.py wrapper for above modules, all hyper-parameters and experiments are handled here.
blending.py do the blending among best models, generate blending model results.
grouper.py use tsne and knn results as extra inputs of training models.
conventions.py some convention functions handling time format and dataframe

Tracing the code

top script entrance: go_train, everything start here!
main.py being the wrapper, it host all kinds of experiment configuration and kick-off.
all the modules are in folder: ./lib

marsan-ma / checkins

Kaggle challenge: predicting checkins

modules

The main modules

wrappers

Tracing the code

About

Languages