We will focus on predicting the survival of Titanic passengers. We have access to training data in the file data.csv
and evaluation data in the file evaluation.csv
.
survived
: Whether the passenger survived (0 = No, 1 = Yes) - target variable to predictpclass
: Class of the ticket (1 = first, 2 = second, 3 = third)name
: Namesex
: Genderage
: Age in yearssibsp
: Number of siblings/spouses on boardparch
: Number of parents/children on boardticket
: Ticket numberfare
: Ticket pricecabin
: Cabin numberembarked
: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)home.dest
: Residence/Destination
- Load the data from the
data.csv
file into a notebook. Divide it into subsets for training, validation, and testing. - Conduct basic data preprocessing:
- Transform features for use in the chosen classification model.
- Optionally create new features based on existing ones.
- Handle missing values appropriately, avoiding methodological errors.
- Utilize visualizations with concise and proper commentary.
- Apply a decision tree and k-nearest neighbors to the prepared data. For each model:
- Comment on the suitability of the model for the given task.
- Select key hyperparameters for tuning and find their optimal values.
- Calculate the F1 score, plot the ROC curve, and determine the AUC. Be cautious about methodological errors.
- Provide thorough commentary on the obtained results.
- Choose the final model from the tested options in the previous step. Estimate the expected accuracy on new data not previously available. Beware of methodological errors.
- Load the evaluation data from the
evaluation.csv
file. Use the final model to make predictions for these data (the target variable is no longer present). Create aresults.csv
file with two columns: ID, survived. Submit this file alongside the notebook.
Example of the first rows in the results.csv
file:
ID,survived
1000,0
1001,1
...