myrfa / dataengtp

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

share title
true
Readme

Data Engineering

The course aims at giving an overview of Data Engineering foundational concepts. It is tailored for 1st and 2nd year Msc students and PhDs who would like to strengthen their fundamental understanding of Data Engineering, i.e., Data Modelling, Collection, and Wrangling.

Origin and Design

The course was originally held in different courses at Politecnico di Milano (๐Ÿ‡ฎ๐Ÿ‡น) by Emanuele Della Valle and Marco Brambilla. The first issue as a unified journey into the data world dates back to 2020 at the University of Tartu (๐Ÿ‡ช๐Ÿ‡ช) by Riccardo Tommasini, and where it is still held by professor Ahmed Awad (course LTAT.02.007). At the same time, the course was adopted by INSA Lyon (๐Ÿ‡ซ๐Ÿ‡ท) as OT7 (2022) and PLD "Data"

logo tartu logo polimi

logo insa

Contributors

Learning Goals

Students of this course will obtain two sets of skills: one that is deeply technical and necessarily technologically biased, one that is more abstract (soft) yet essential to build the professional figure that fits in a data team.

Soft Skills

  • Build a general overview of the data lifecycle and its problems
    • ETL, ELT
    • Data Pipelines
  • Understand the role of the Data Engineer, as opposed to the more famous figure of Data Scientist.
    • Skills required by data engineers
    • Fitting in a data team
  • Presenting a data project
    • Writing a post mortem

Creativity Bonus: the students will be encouraged to come up with their information needs about a domain of their choice.

Hard Skills

After a general overview of the data lifecycle, the course deeps into an (opinionated) view of the modern [[Data Warehouse]]. To this extent, the will touch basic notions of Data Wrangling, in particular Cleansing and Transformation.

At the core of the course learning outcome there is the ability to design, build, and maintain a data pipelines.

Technological Choice (May 2023): Apache Airlfow

[[Why Apache Airfllow?]]

Regarding the technological stack of the course, modulo the choice of the lecturer, the following systems are encourages.

  • Data Formats
    • JSON, CSV, Avro
    • Parquet
  • Relational Databases
    • Postgres
  • Document Databases
    • MongoDB
  • Columnar Stores
    • DuckDB
  • Key-Value Stores
    • Redis
  • Graph Databases
    • Neo4J

The interaction with the aforementioned systems is via python using Jupyter notebooks. The environment is powered by [[Docker]] and orchestrated using [[Docker Compose]].

Challenge Based Learning

The cours follows a challenge based learning approach. In particular, each system is approached independently with a uniform interface (python API). The students main task is to build a pipelines managed by Apache Airflow that integrates 3/5 of the presented systems. The course schedule does not include an explanation on how such integration should be done. Indeed, it is up for each group to figure out how to do it, if developing a custom operator or scheduling scripts or more. The students are then encouraged to discuss their approach and present the limitations.

Prerequisites

  • Git and Github
  • SQL and Relational Databases
  • Python
  • Java (basics)

Syllabus

๐Ÿ–๏ธ = Practice ๐Ÿ““ = Lecture

  • Into: Who is the Data Engineer + Data Lifecycle (๐Ÿ““)
  • Docker Fundamentals (๐Ÿ–๏ธ)
  • Data Modelling (๐Ÿ““)
    • ER (basics)
    • Dimensional Modelling and Star Schema (basics)
    • Data Formats: JSON, CSV, Avro, Parquet (๐Ÿ–๏ธ)
  • Data Orchestrations and Data Pipelines (๐Ÿ““)
    • Apache Airflow (๐Ÿ–๏ธ)
  • Data Wrangling and Cleansing (๐Ÿ““)
    • Pandas (๐Ÿ–)
  • Data Ingestion and Document Databases (๐Ÿ““)
    • MongoDB (๐Ÿ–)
  • Caching and Key-Value Stores (๐Ÿ““)
    • Redis or RocksDB (๐Ÿ–)
  • Querying: (๐Ÿ““)
    • Postgres or SQLLite (๐Ÿ–)
  • Advanced Querying: Wide Columns Stores and Graph Databases (๐Ÿ““)
    • Cassandra or DuckDB
    • Neo4J or Memgraph (๐Ÿ–)

Evaluation

  • Multiple Choice Questions (30%)
  • Project (60%)
  • Presentation (10%)

Latest Course Issue

Books

Database System Concepts 7th Edition Avi Silberschatz Henry F. Korth S. Sudarshan McGraw-Hill ISBN 9780078022159

About


Languages

Language:Jupyter Notebook 76.2%Language:HTML 23.0%Language:Python 0.8%Language:Dockerfile 0.0%