wentao-uw / MAST30034_Python

The Applied Data Science (Python Stream) repository written by myself for 2021 Semester 2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Welcome to the MAST30034 Python Repo

The R stream is available here.

Dates and Times

On Campus:

  • Monday 13:15 - 15:15 (R - Yue)
  • Tuesday: 14:15 - 16:15 (Python - Akira)
  • Wednesday: 11:00 - 13:00 (Python - Akira)
  • Thursday: 10:00 - 12:00 (Python - Calvin)

Online:

  • Tuesday: 16:15 - 18:15 (R - Yue)
  • Wednesday: 14:15 - 16:15 (Python - Calvin)
  • Thursday: 13:00 - 15:00 (Python - Akira), 15:15 - 17:15 (Python - Akira)

Tutorials

The first few tutorials will have content, with the remainder of the semester treated as consultations or additional tutorials as outlined:

  1. Introduction and Project 1 Overview:

    • Using the JupyterHub server
    • Using GitHub Desktop vs Git CLI (Command Line Interface)
    • Project 1 Overview
    • Python Revision
    • Introduction to folium and bokeh
    • Data Serialization
    • Downloading Files using Python
    • Advanced: WSL2 Installation + PySpark Installation
  2. Geospatial Visualization and Analysis:

    • Map Clusters, GIS Heatmaps, HexBins (vs SquareBins), Choropleths.
    • Using and installing geopandas.
    • Descriptive statistics
    • Histograms and Binning
    • Advanced: PySpark
  3. Regression and Discussion:

    • Linear Regression
    • AIC vs MSE vs R-Squared
    • Stepwise Selection (backwards and forward using AIC)
    • Penalized Regression (LASSO and Ridge)
    • Generalized Linear Model example (Poisson for count data)
    • Advanced: PySpark + Spark SQL
  4. Machine Learning and Working as a Team:

    • Discussion: Overfitting, Curse of Dimensionality, Feature Engineering, etc.
    • Dimensionality Reduction
    • Agile Methodology + Standups
    • Advanced: PySpark + Spark SQL
  5. Project 2 Overview

    • Introduction of themes
    • Getting into teams
    • Assessment Overview

Project 2 Tutorials (Week 6 - 12)

  • Attendance is mandatory. Groups are excused one absence only.
  • The last 2 weeks of tutorials will be Presentations, all groups must attend a designated tutorial.
  • The remainder of tutorials will act as checkpoints, consultation, and a chance for your group to conduct standups at a fixed time slot.

Python Libraries Covered

Statistical Modeling / Machine Learning:

  • sklearn, statsmodels

Data Engineering / End-to-End Pipelines:

  • Pandas, PySpark, NumPy, GeoPandas, papermill, re (regex)

Visualizations:

  • Plotly, Folium, Bokeh, seaborn, matplotlib

About

The Applied Data Science (Python Stream) repository written by myself for 2021 Semester 2


Languages

Language:Jupyter Notebook 59.1%Language:HTML 40.9%