Data Engineering ZoomCamp 2024

This repo contains homework, notes and final project(s) for the Data Engineering Zoomcamp by Datatalks.Club.

Each week I completed a series of videos and followed this up with homework exercises.

Tools

We used a range of tools:

Terraform: Infrastructure-as-Code (IaC)
Docker: Containerization
SQL: Data Analysis & Exploration
Mage: Workflow Orchestration. You can use Airflow too.
DBT(Data build tool): Open-source command-line tool that enables data analysts and engineers to transform and model data in their data warehouses using SQL.
Metabase: Open-source business intelligence (BI) and analytics tool that allows users to easily visualize and analyze their data. You can use Google looker studio.
Google Dataproc: Serivce used to run Apache Hadoop, Apache Spark, Apache Hive, Apache Pig, and other big data processing frameworks. Similar to Amazon EMR or Azure HDInsight.
Google Cloud Storage: Google datalake. Similar to Amazon S3 or Azure blob storage.
BigQuery: Google datawarehouse. Similar to Amazon redshift or Azure Synapse Analytics.
Apache Spark: Excutes data engineering, data science, and machine learning on single-node machines or clusters.
Pyspark: Python API for Apache Spark.

Language:Jupyter Notebook 85.9%Language:Python 11.4%Language:HCL 2.3%Language:Shell 0.3%Language:Dockerfile 0.1%