guoliveira / data-engineer-zoomcamp-project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Average Temperature in Portugal in the last 20 years 😎

Data Engineer Zoomcamp Capstone Project

This capstone project was developed under the scope of the Data Engineer Zoomcamp by DataTalksClub (the biggest Data community in the internet - DTC).

The above zoomcamp had the following main topics/tools:

  • Docker and docker-compose;
  • Google Cloud Platform;
  • Terraform;
  • Airflow;
  • Data Warehouse with Big Query;
  • Analytics Engineering with Data Build Tool (DBT);
  • Batch with Spark;
  • Streaming with Kafka.

The zoomcamp is completed with a personal Project envolving some of those tools/topics.

For my project I decided to analyse the historical temperature in Portugal in the last 20 years. More specifically, I decided to analyse the average temperature from 2000 to 2020. (It was decided to avoid 2021 due possible mistakes and 2022 since it is incomplete).

With this project I intend to analyse the Portuguese average temperature over the last 20 years in Portugal

Portugal

In terms of dataset I had many choices, but I decided to use the NOAA dataset available in the AWS Open Data.

This dataset has the following description:

Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more than 175 years old. The data is in CSV format. Each file corresponds to a year from 1763 to present and is named as such.

Used Technologies πŸ”¨

For this project I decided to use the following tools:

  • Docker - to proceed to the containerization of other technologies;
  • Airflow - for the orchestration of the full pipeline;
  • Terraform - As a Infrastructure-as-Code (IaC) tool;
  • Google Cloud Storage (GCS) - for storage as Data Lake;
  • BigQuery- for the project Data Warehouse;
  • Spark - for the transformation of raw data in refined data;
  • Google Data studio - for visualizations.

Development Steps 🚧

This capstone followed these general development steps:

  1. Started a Google Cloud Platform free account (this step was skipped since we already built before for the zoomcamp);

  2. Creation of GCP project with the name "Capstone-Luis-Oliveira" and followed the advanced steps in here

  3. Creation of a GCP infrastructure using Terraform. This infrastructure includes Big Query and Storage. The steps can be seen here;

  4. Development of DockerFile and Docker-Compose structure to run Airflow.

  5. Start to run Airflow inside a container and development of two DAG for the pipeline data.

  6. Ran the two Dags in order get the raw files and refined (after transformation) files in GCP Storage. The data pipeline is presented here.

  7. Creation of two tables in BigQuery using DDL. One table with Portuguese weather station information (code, latitude, longitude, region and location) and one partitioned table with the average temperature by day and by station (this table was partition by year). The DDL queries are presented here.

It was decided to run a DDL Query in BigQuery because it was going to be done only once (it was an unnecessary energy to set it in Airflow)

  1. Development of a visualization using Datastudio. To obtain a better visualization it was created this view being used for Datastudio. The developed charts are present in this link

Thank you for you attention.

πŸ˜‰

About


Languages

Language:Jupyter Notebook 54.7%Language:Python 37.9%Language:HCL 3.3%Language:Dockerfile 3.0%Language:Shell 1.2%