Overview

Welcome to the Research Computing Training Program, Module 2. This module will teach you the basics of python programming with a focus on applying them to the manipulation of data files. As in Module 1, you will be provided with basic training materials and links to resources that you will use in aid of a machine learning model from one of python's popular modules, sklearn. Understanding the inner details of how this model functions will be left to later, detailed sections on machine learning and AI. For this task it is sufficient to demonstrate knowledge of how to import data, basic cleaning, and report the results of a random forest model from scikit-learn using the the titanic dataset.

Required Tools

For a longer list of tools, please see the really quite good version established in Module 1. Notably, we will use Git for version control and something like Visual Studio Code for editing.

Python 3 - Python 3 had a longer time than expected to get established, but at this point, most scientific code is moving over to python 3 and if you are learning python fresh there is no benefit to starting with python 2.

Deliverables

The overall project you will be expected to deliver is to go from the initial input files of the titanic dataset from Kaggle and, after some basic data cleaning produce some results from a random forest. Machine learning and statistics often begins by understanding the type of data you are trying to predict. Here, the problem as defined from the kaggle website is "in this challenge, we ask you to complete the analysis of what sorts of people were likely to survive." This statements hints at a categorical dependent variable. An example of a numerical regression would be to look at say, the age at which a type of cancer occurs in the general population.

Resources

Most of your python learning will come in the form of working through Jupyter notebooks that form courses of Kaggle Learn.

Python , Kaggle Learn There are a lot of extra fringe things you can do in python, but the lessons laid out here are a good starting point to writing solid code. Searching for "random forest" turns up many results.
Pandas , Kaggle Learn Pandas is the main python library for data import and manipulation. This course is a series of notebooks on how to wrangle data to get it into a form that is amenable for downstream analysis.
Kaggle Kernels Looking at how other people have cleaned and analyzed this data is encouraged; however your code must be your own and you must be able to explain it.
Machine learning, Kaggle learning bonus This example may be informative. However! Note that it is a separate dataset, and is for a Random Forest Numerical Regressor, rather than a Random Forest Classifier which is what we will use for our Titanic dataset.
scikit-learn cheat sheet decision tree of possible possible algorithms to use in scikit-learn dependening on your data (there are a lot)

thomas-keller / intro_python_rc

Overview

Required Tools

Recommended

Deliverables

Resources

About