nachocarracedo / Santader_kaggle_stacking

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Santader_kaggle_stacking

Santander classification Kaggle competiton: https://www.kaggle.com/c/santander-customer-satisfaction

I used this competition to create model ensembles which usually improve the loss score. Here is a nice explanation: http://mlwave.com/kaggle-ensembling-guide/

I practice with stacking, here is the procedure I followed:

2-fold stacking:

  • Split the train set in 2 parts: train_a and train_b
  • Fit a first-stage model on train_a and create predictions for train_b
  • Fit the same model on train_b and create predictions for train_a
  • Finally fit the model on the entire train set and create predictions for the test set.
  • Now train a second-stage stacker model on the probabilities from the first-stage model(s).

More detailed info:

  • First, I do some basic feature engineering and thn I do feature selection to remove noise (using best features from Gradient Boosting).
  • For first-stage models I used RF Gini, RF Entropy, 2 x Gradient Boosting, and AdaBoost.
  • For second-stage model I tried 3: 1) Logistic regression 2) RF 3) Weights on 1st stage models.

Conclusion - next steps.

  • I got a little improvement with stacking but not a big one. Throwing more models into the mix will probably help.
  • Use skitlearn to find weights for 2nd stage. I did it manually and it is very expensive.
  • Save ensambles for the last step of you analysis, after getting you best score and having tried different features. Evaluate if the gain is worth the effort.

TODO:

  • Add comments !!!! I'm sorry for not doing this before hand; I know it makes the project unreadable. Lack of time is responsible but I’ll introduce them as soon as I can.

About


Languages

Language:Python 100.0%