This repo contains homework, notes and final project(s) for the Data Engineering Zoomcamp by Datatalks.Club.
Each week I completed a series of videos and followed this up with homework exercises.
We used a range of tools:
- Terraform: Infrastructure-as-Code (IaC)
- Docker: Containerization
- SQL: Data Analysis & Exploration
- Mage: Workflow Orchestration. You can use Airflow too.
- DBT(Data build tool): Open-source command-line tool that enables data analysts and engineers to transform and model data in their data warehouses using SQL.
- Metabase: Open-source business intelligence (BI) and analytics tool that allows users to easily visualize and analyze their data. You can use Google looker studio.
- Google Dataproc: Serivce used to run Apache Hadoop, Apache Spark, Apache Hive, Apache Pig, and other big data processing frameworks. Similar to Amazon EMR or Azure HDInsight.
- Google Cloud Storage: Google datalake. Similar to Amazon S3 or Azure blob storage.
- BigQuery: Google datawarehouse. Similar to Amazon redshift or Azure Synapse Analytics.
- Apache Spark: Excutes data engineering, data science, and machine learning on single-node machines or clusters.
- Pyspark: Python API for Apache Spark.