Titanic Competition for Kaggle.
0.82775
- Extract 'Title' and 'LastName' from 'Name'.
- Combine 'SibSp' and 'Parch' together to 'FamMem' (family member), add new feature 'Alone'.
- Only preserve info of 'Cabin' that sb has cabin or not, discard the detailed cabin info.
- Family members share same 'LastName' and 'Fare', add new feature 'FamSurvived' indicates if there is a family member survived.
- Drop 'Name', 'LastName', 'SibSp', 'Parch', 'Ticket'.
- Change feature representation to numerical.
- Fill missing values in 'Age' according to 'Pclass' and 'Sex' (may correlated) and represent 'Age' by age band.
- Fill missing values in 'Fare' by mean (only for test set) and represent 'Fare' by fare band.
Use random forest classifier with grid-searched parameters.
- Well-designed model ensembling and stacking. Current fine-tuned random forest classification result is better than simple emsembling many models.
- Check if current features are redundant, or we need to artificially create new features (e.g., group feature extraction, see reference 4).
- Plot learning curve for model checking.
- https://www.kaggle.com/startupsci/titanic-data-science-solutions
- https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python
- https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83
- https://www.kaggle.com/shunjiangxu/blood-is-thicker-than-water-friendship-forever
- https://blog.csdn.net/guoxinian/article/details/73740746