- java jdk 11
- python 3.8+
- docker and docker compose
- install the python requirements:
pip install -r requirememts.txt
- Grants direnv to load the given .envrc:
direnv allow
This project uses different services, to simplify the testing and the configuration of these services, some docker compose files are used to run them:
- Minio: A S3 service used for storage (used as hadoop DFS for spark jobs, and as a sink for some kafka connectors)
- Hive: A metastore use by spark to store and manage the metadata of persistent relational entities (DBs, tables, ...), and it uses a postgres DB as a backend storage.
- Kafka stack with following services:
- Zookeeper: used to track the status of nodes in the Kafka cluster and maintain a list of Kafka topics and messages
- broker: it's a kafka node used to store the messages log, it handles the client requests (produce, consume, ...)
- schema registry: a service used to store avro schema in order to use them later to deserialize the messages
- control center: a UI used to manage the kafka cluster
- create reddit topics: a simple container used to init the cluster by creating the topics we need for our app
- reddit connector: a kafka connector running in standalone mode to stream reddit posts and comments
invoke compose.up
A kafka connector used to call the reddit API and read posts and comments for a list of subreddits, then write them to two kafka topics. Here is the documentation of this service.
It's a java project contains 3 modules:
- kafka-producers: a module used to fake kafka messages to test the different services
- kstreams-apps: a module contains some applications based on confluent kstreams
- spark-streaming-apps: a module contains some applications based on spark structured streaming