classification decision-tree-classifier knn-classifier matplotlib pandas python scikit-learn titanic-survival-prediction

Data Preprocessing and Binary Classification

Data Source

We will focus on predicting the survival of Titanic passengers. We have access to training data in the file data.csv and evaluation data in the file evaluation.csv.

Feature List

survived: Whether the passenger survived (0 = No, 1 = Yes) - target variable to predict
pclass: Class of the ticket (1 = first, 2 = second, 3 = third)
name: Name
sex: Gender
age: Age in years
sibsp: Number of siblings/spouses on board
parch: Number of parents/children on board
ticket: Ticket number
fare: Ticket price
cabin: Cabin number
embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
home.dest: Residence/Destination

Given Task

Load the data from the data.csv file into a notebook. Divide it into subsets for training, validation, and testing.
Conduct basic data preprocessing:
- Transform features for use in the chosen classification model.
- Optionally create new features based on existing ones.
- Handle missing values appropriately, avoiding methodological errors.
- Utilize visualizations with concise and proper commentary.
Apply a decision tree and k-nearest neighbors to the prepared data. For each model:
- Comment on the suitability of the model for the given task.
- Select key hyperparameters for tuning and find their optimal values.
- Calculate the F1 score, plot the ROC curve, and determine the AUC. Be cautious about methodological errors.
- Provide thorough commentary on the obtained results.
Choose the final model from the tested options in the previous step. Estimate the expected accuracy on new data not previously available. Beware of methodological errors.
Load the evaluation data from the evaluation.csv file. Use the final model to make predictions for these data (the target variable is no longer present). Create a results.csv file with two columns: ID, survived. Submit this file alongside the notebook.

Example of the first rows in the results.csv file:

ID,survived
1000,0
1001,1
...

About

Predicting the survival of Titanic passengers (binary classification) with usage of decision tree and KNN from scikit-learn.

classification decision-tree-classifier knn-classifier matplotlib pandas python scikit-learn titanic-survival-prediction

Languages

Language:Jupyter Notebook 100.0%