Santader_kaggle_stacking

I used this competition to create model ensembles which usually improve the loss score. Here is a nice explanation: http://mlwave.com/kaggle-ensembling-guide/

I practice with stacking, here is the procedure I followed:

2-fold stacking:

Split the train set in 2 parts: train_a and train_b
Fit a first-stage model on train_a and create predictions for train_b
Fit the same model on train_b and create predictions for train_a
Finally fit the model on the entire train set and create predictions for the test set.
Now train a second-stage stacker model on the probabilities from the first-stage model(s).

More detailed info:

First, I do some basic feature engineering and thn I do feature selection to remove noise (using best features from Gradient Boosting).
For first-stage models I used RF Gini, RF Entropy, 2 x Gradient Boosting, and AdaBoost.
For second-stage model I tried 3: 1) Logistic regression 2) RF 3) Weights on 1st stage models.

Conclusion - next steps.

I got a little improvement with stacking but not a big one. Throwing more models into the mix will probably help.
Use skitlearn to find weights for 2nd stage. I did it manually and it is very expensive.
Save ensambles for the last step of you analysis, after getting you best score and having tried different features. Evaluate if the gain is worth the effort.

TODO:

Add comments !!!! I'm sorry for not doing this before hand; I know it makes the project unreadable. Lack of time is responsible but I’ll introduce them as soon as I can.

nachocarracedo / Santader_kaggle_stacking