alexyarosh / csci5352-f23

Materials for the course "Datacenter-scale computing" at CU Boulder Fall 2023

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CSCI 5253: Datacenter-scale computing

In this repository, you'll find the files for the labs for the CSCI 5253 course for Fall '23.

Our goal is to build up a scalable data pipeline processing data from Austin Animal Shelter Outcomes over the course of the semester.

Lab 1:

  • Create a dockerized script reading data from a csv, processing it, and outputting into another csv

Lab 2:

  • Create a dockerized postgres data warehouse to store the data
  • Use dimensional modeling for the data
  • Load the data into the DW through docker-compose

Lab 3:

  • Change the pipeline to put the intermediate data into cloud storage at every step
  • Switch postgres DW to cloud DW
  • Orchestrate the pipeline with Airflow

About

Materials for the course "Datacenter-scale computing" at CU Boulder Fall 2023


Languages

Language:Jupyter Notebook 91.9%Language:Python 7.8%Language:Dockerfile 0.3%