big-data scraping mongodb elasticsearch data-engineering kafka kafka-connect debezium django-rest-framework redis airflow minio s3 python data-pipeline hacktoberfest

Data Engineering Project

Data Engineering Project is an implementation of the data pipeline which consumes the latest news from RSS Feeds and makes them available for users via handy API. The pipeline infrastructure is built using popular, open-source projects.

Access the latest news and headlines in one place. 💪

Architecture diagram

How it works

Data Scraping

Airflow DAG is responsible for the execution of Python scraping modules. It runs periodically every X minutes producing micro-batches.

First task updates proxypool. Using proxies in combination with rotating user agents can help get scrapers past most of the anti-scraping measures and prevent being detected as a scraper.
Second task extracts news from RSS feeds provided in the configuration file, validates the quality and sends data into Kafka topic A. The extraction process is using validated proxies from proxypool.

Data flow

Kafka Connect Mongo Sink consumes data from Kafka topic A and stores news in MongoDB using upsert functionality based on _id field.
Debezium MongoDB Source tracks a MongoDB replica set for document changes in databases and collections, recording those changes as events in Kafka topic B.
Kafka Connect Elasticsearch Sink consumes data from Kafka topic B and upserts news in Elasticsearch. Data replicated between topics A and B ensures MongoDB and ElasticSearch synchronization. Command Query Responsibility Segregation (CQRS) pattern allows the use of separate models for updating and reading information.
Kafka Connect S3-Minio Sink consumes records from Kafka topic B and stores them in MinIO (high-performance object storage) to ensure data persistency.

Data access

Data gathered by previous steps can be easily accessed in API service using public endpoints.

Prerequisites

Software required to run the project. Install:

Docker - You must allocate a minimum of 8 GB of Docker memory resource.
Python 3.8+ (pip)
docker-compose

Running project

Script manage.sh - wrapper for docker-compose works as a managing tool.

Build project infrastructure

./manage.sh up

Stop project infrastructure

./manage.sh stop

Delete project infrastructure

./manage.sh down

Testing

Script run_tests.sh executes unit tests against Airflow scraping modules and Django Rest Framework applications.

./run_tests.sh

API service

Read detailed documentation on how to interact with data collected by pipeline using search endpoints.

Example searches:

see all news

http://127.0.0.1:5000/api/v1/news/

add search_fields title and description, see all of the news containing the Robert Lewandowski phrase

http://127.0.0.1:5000/api/v1/news/?search=Robert%20Lewandowski

find news containing the Lewandowski phrase in their titles

http://127.0.0.1:5000/api/v1/news/?search=title|Lewandowski

see all of the polish news containing the Lewandowski phrase

http://127.0.0.1:5000/api/v1/news/?search=lewandowski&language=pl

References

Inspired by following codes, articles and videos:

Contributions

Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Please feel free to contact me if you have any questions. Damian Kliś @DamianKlis

About

Example end to end data engineering project.