Leo200467 / de-zoomcamp-2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Engineering Zoomcamp 2022

This repo contains week-by-week following Data Engineering Zoomcamp by Data Talks.

Week 1 - Introduction to Docker / Basics and Setup

Started: 17 January 2022

  • First steps with Docker, creating images and containers ready for work.

  • Introduction to Terraform and Infrastructure as a Code.

  • Implementing PostgreSQL in a container and ingesting data using Python/Pandas.

All homework completed.

More details can be found in this folder

Week 2 - Data Ingestion

Started: 24 January 2022

  • Introduction to Workflow Orchestration and why use it.

  • Setup Airflow enviroment with Docker.

  • First Data Transfer job in Google Cloud using "Data Transfer" tool.

  • Transfering local data to PostgreSQL using Airflow DAG.

  • Transfering data from S3 to Google Cloud using Airflow, data is also converted from CSV to Parquet.

All homework completed.

More details can be found in this folder

Week 3 - Data Warehouse

Started: 31 January 2022

  • Data warehouse (BigQuery)
  • What is a data warehouse solution
  • What is big query, why is it so fast, Cost of BQ
  • Partitoning and clustering, Automatic re-clustering
  • Pointing to a location in google storage
  • Loading data to big query & PG -- using Airflow operator
  • BQ best practices
  • Misc: BQ Geo location, BQ ML
  • Alternatives (Snowflake/Redshift)

All homework completed.

More details can be found in this folder

Week 4: Analytics engineering

Started: 7 February

  • Basics of analytics engineering
  • Developing a dbt project (Combination of coding + theory)
  • Visualising the data in Google data studio

All homework completed.

More details can be found in this folder

Week 5: Batch processing

Started: 16 February

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins
  • Resilient Distributed Dataset (RDD)

All homework completed.

More details can be found in this folder

Week 6: Stream processing

Started: 16 February

  • Basics
    • What is Kafka
    • Internals of Kafka, broker
    • Partitoning of Kafka topic
    • Replication of Kafka topic
  • Consumer-producer
  • Schemas (avro)
  • Streaming
  • Kafka streams
  • Kafka connect
  • Alternatives (PubSub/Pulsar)

All homework completed.

More details can be found in this folder

Week 7: Final Project

Final project is available here

About


Languages

Language:Python 43.4%Language:Jupyter Notebook 41.5%Language:PLpgSQL 8.9%Language:Dockerfile 2.8%Language:HCL 2.2%Language:Shell 0.8%Language:Makefile 0.4%