marinavillaschi / report-generator

ETL process to extract data from API and make it available for querying on AWS Athena

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Report Generator Project

Generating reports on a schedule is a need for many industries. It's a repetitive task that can be time consuming and subject to human error.

This is the problem I'm gonna be solving with this project.

Automating report generation saves employees time and assures all reports are produced the same safe and tested way avoiding human error and guaranteeing data and report quality. This way the employees can focus their time and attention on other more important matters.

Project Description

This project has two main cores:

1. Cron based data fetcher lambda service

AWS Lambda service that runs on a schedule to fetch data from external API and upload it to an S3 bucket.

2. Event based report generator lambda service

AWS Lambda service that runs based on the event of new data landing on the S3 bucket that will trigger a glue crawler so that our data is available to be queried from aws Athena.

In a nutshell:

We have data from the CoinGecko API comming in daily to an AWS S3 bucket in csv format.

S3_bucket_snapshot

As soon as this data comes into the bucket, it triggers an AWS Glue Crawler.

glue_crawler_snapshot

This crawler crawls the data and creates/updates a glue database and table.

glue_database_snapshot

glue_table_snapshot

Once this data is catalogued by AWS Glue it can be queried from AWS Athena.

athena_snapshot

TODO:

  • Create glue crawler on template to run everytime new data comes in S3 to create/update glue database OK!

  • Set up Athena for reading data from S3 using database created by crawler OK!

  • Create dashboard to feed from data using Athena

How to run it

This project was made using the AWS SAM CLI.

To reproduce it, you need to:

  • create a python 3.9 virtual environment (Python 3.9 needed):

    py -3.9 -m venv venv

  • Validate the template.yaml file:

    sam validate or sam validate --lint

  • Build the application:

    sam build

  • Deploy the application:

    sam deploy or sam deploy --guided (to pass the env vars)

You can find more details on SAM CLI commands here.

Author

Marina Villaschi

Acknowledgements

CoinGecko API for the data provided.

About

ETL process to extract data from API and make it available for querying on AWS Athena

License:MIT License


Languages

Language:Python 100.0%