NickolasB98/aws-severless-project

A Severless AWS Project fetching weather data from an API, utilizing these AWS Services: Lambda, Kinesis Firehose, S3, Glue Crawler, Glue ETL Workflow Orchestration, EventBridge as a Lambda Trigger, and CloudWatch Logs for monitoring the Lambda functions and ETL job scripts. The processed data is then visualized using Grafana connected to Athena for interactive exploration.

Project Architecture

This project leverages a serverless architecture on AWS to build a data pipeline for weather data.

Here's a breakdown of the key components and their roles:

Data Source:

The data originates from the [(https://open-meteo.com/)] API ([([https://open-meteo.com/en/docs)]). This API provides access to historical weather data for Groningen,NL the city I obtained my MSc degree. Open-meteo provides access through APIs to both historical and real-time weather data for various locations around the world.

AWS Lambda Functions:

Batch Data Lambda: This serverless function triggers upon new weather data batches arriving from the API. It pre-processes and prepares the data before sending it to the Kinesis Firehose for streaming.

Continuous Data Lambda: This function is triggered by AWS EventBridge at time intervals. It's designed to handle continuous streams of weather data, performing real-time processing before sending it to the Firehose.

Amazon Kinesis Firehose:

Based on the invoked Lambda function, the Firehose can handle data in two ways:

Batch Data: If triggered by the Batch Data Lambda, the Firehose streams the prepared data in batches to its S3 destination.

Continuous Data: When triggered by the Continuous Data Lambda (at timed intervals), the Firehose continuously streams the real-time data to S3.

AWS offers automated Firehose stream metrics, for Incoming bytes, Put Requests, Records or potential Throttled Records counts,etc. :

Amazon S3:

Amazon S3 serves as the storage layer for this project, housing the weather data at various stages of the pipeline. Here's a breakdown of the different S3 buckets and their purposes:

Weather Data Buckets:

weather-(batch)data-bucket: These buckets store the raw, unprocessed weather data received from the Open-Meteo API. The data format might be JSON, CSV, or the original format provided by the API. These buckets might be named with timestamps or identifiers indicating the time period of the data (e.g., May 2024).

Processed Data Buckets:

open-meteo-weather-batch-data-parquet-bucket/: These buckets contain historical weather data that has been processed and converted into the Parquet format by ETL job scripts. Parquet is columnar and optimized for efficient querying with Athena, making it ideal for later analysis.

parquet-weather-table-prod/: These buckets hold the final, transformed weather data stored in Parquet format, making the final product of the ETL Workflow. This is the data readily available for querying and analysis with Athena and potentially for visualization with Grafana.

Temporary Buckets:

Buckets named like aws-athena-query-results-** / `store-query-results-for-athena-** are used temporarily to store the results of Athena queries. Depending on the configuration, these buckets could be automatically cleaned up after a set period.

Firehose Partitioning into the S3:

The Kinesis Firehose automatically partitions the data as it delivers it to S3 buckets. This partitioning helps Glue Crawler efficiently discover the schema of the data. Each partition represents a specific time period or data segment, making it easier to query and analyze specific weather data ranges using Athena later.

By utilizing different S3 buckets for various data stages, the project maintains a clear separation between raw, processed, and final weather data. This organization simplifies data retrieval for analysis and ensures efficient querying with Athena.

This is a snapshot of all my S3 buckets, for both the historical batch weather data, and the continuous incoming weather data triggered by the EventBridge. Both data are fetched by the same Firehose.

AWS Glue:

Glue Crawler:

This automatically discovers and defines the schema of the weather data stored in S3.

Glue ETL Workflow Orchestration:

We utilize Glue's capabilities to define and orchestrate the data transformation logic. A series of Glue jobs perform data transformations, data quality checks, and ultimately save the processed data to a new table stored as Parquet files.

Amazon Athena:

This serverless interactive query service allows us to analyze the transformed weather data using standard SQL queries.

Grafana:

Grafana, a visualization tool, connects to Athena, enabling the creation of interactive dashboards to explore the weather data insights. You can leverage standard SQL queries within Grafana to visualize the processed data.

While static images of the dashboard are included below, the power of Grafana lies in its interactivity. To explore the dashboard functionalities directly, you can access a linked snapshot:

([https://nickolasb98.grafana.net/dashboard/snapshot/1uXP8OvJex8ybSvKMYCoRaydwpI9eooa])

The static snapshots as pdf files for a quick overview:

Pipeline Functionality:

Data Ingestion: A Lambda function is triggered periodically (or based on an event) to fetch weather data from the chosen API.

Data Streaming: The retrieved data is sent to Amazon Kinesis Firehose for continuous streaming.

Data Storage: The Firehose delivers the data to an S3 bucket for raw data storage.

Schema Discovery: A Glue Crawler automatically discovers and defines the schema of the data stored in S3.

Data Transformation: Glue ETL jobs are designed to: Cleanse and transform the data as needed. Perform data quality checks to ensure data integrity.

Data Storage (Processed): The transformed data is saved to a new table in S3 using the Parquet format, optimized for analytics.

Data Analysis: Amazon Athena, when connected to Grafana Cloud, allows querying the processed weather data using standard SQL for further analysis and visualization.

Monitoring: This project utilizes AWS CloudWatch Logs for centralized monitoring of the data pipeline components. CloudWatch Logs capture details about: Lambda Function Execution: Invocation time, duration, and any errors encountered during data ingestion. Glue ETL Job Execution: Start and end times, completed job steps, and any errors during data transformation. By analyzing CloudWatch Logs, you can identify potential issues, monitor performance, and ensure the smooth operation of the pipeline.

Benefits:

Scalability and Cost-Efficiency: Serverless architecture scales automatically based on data volume and minimizes infrastructure management costs.

Flexibility: The pipeline can be easily adapted to handle different weather APIs or data sources.

Automation: Data ingestion, transformation, and storage are automated, reducing manual intervention.

Analytics Ready: The processed data in Parquet format is optimized for efficient querying with Athena.

NickolasB98 / aws-severless-project

Weather Data Buckets:

Processed Data Buckets:

Temporary Buckets:

Firehose Partitioning into the S3:

About

Languages