A simple pipeline using:
- Kaggle Book Data
- AWS S3
- Google Books API
- Kafka
- Pandas
- Flask
- PostgreSQL
- CSV data is loaded into S3
- CSV data is read from S3
- Initiated via a GET request to an available endpoint (/loaddata)
- Example: Scheduled batch job issues a request to the endpoint to load more data into the PSQL database.
- Additional book data is pulled from the Google Books API via a given ISBN
- Additional book data is merged with the CSV data from S3
- Merged data is sent to three different defined topics in a running Kafka cluster
- Data in Kafka is consumed via a consumer
- Consumed Kafka data is transformed into the appropriate table schemas defined in our PostgreSQL database
- Data is inserted into the appropriate table in our PostgreSQL database
- To apply what I learned from Data Engineering resources online