Started: 17 January 2022
-
First steps with Docker, creating images and containers ready for work.
-
Introduction to Terraform and Infrastructure as a Code.
-
Implementing PostgreSQL in a container and ingesting data using Python/Pandas.
All homework completed.
More details can be found in this folder
Started: 24 January 2022
-
Introduction to Workflow Orchestration and why use it.
-
Setup Airflow enviroment with Docker.
-
First Data Transfer job in Google Cloud using "Data Transfer" tool.
-
Transfering local data to PostgreSQL using Airflow DAG.
-
Transfering data from S3 to Google Cloud using Airflow, data is also converted from CSV to Parquet.
All homework completed.
More details can be found in this folder
Started: 31 January 2022
- Data warehouse (BigQuery)
- What is a data warehouse solution
- What is big query, why is it so fast, Cost of BQ
- Partitoning and clustering, Automatic re-clustering
- Pointing to a location in google storage
- Loading data to big query & PG -- using Airflow operator
- BQ best practices
- Misc: BQ Geo location, BQ ML
- Alternatives (Snowflake/Redshift)
All homework completed.
More details can be found in this folder
Started: 7 February
- Basics of analytics engineering
- Developing a dbt project (Combination of coding + theory)
- Visualising the data in Google data studio
All homework completed.
More details can be found in this folder
Started: 16 February
- Batch processing
- What is Spark
- Spark Dataframes
- Spark SQL
- Internals: GroupBy and joins
- Resilient Distributed Dataset (RDD)
All homework completed.
More details can be found in this folder
Started: 16 February
- Basics
- What is Kafka
- Internals of Kafka, broker
- Partitoning of Kafka topic
- Replication of Kafka topic
- Consumer-producer
- Schemas (avro)
- Streaming
- Kafka streams
- Kafka connect
- Alternatives (PubSub/Pulsar)
All homework completed.
More details can be found in this folder
Final project is available here