hudi kafka kafka-connect kafka-streams spark spark-streaming

Stream processing demo

Requirements

java jdk 11
python 3.8+
docker and docker compose
install the python requirements:

pip install -r requirememts.txt

Grants direnv to load the given .envrc:

direnv allow

Docker compose stack

Description

This project uses different services, to simplify the testing and the configuration of these services, some docker compose files are used to run them:

Minio: A S3 service used for storage (used as hadoop DFS for spark jobs, and as a sink for some kafka connectors)
Hive: A metastore use by spark to store and manage the metadata of persistent relational entities (DBs, tables, ...), and it uses a postgres DB as a backend storage.
Kafka stack with following services:
- Zookeeper: used to track the status of nodes in the Kafka cluster and maintain a list of Kafka topics and messages
- broker: it's a kafka node used to store the messages log, it handles the client requests (produce, consume, ...)
- schema registry: a service used to store avro schema in order to use them later to deserialize the messages
- control center: a UI used to manage the kafka cluster
- create reddit topics: a simple container used to init the cluster by creating the topics we need for our app
- reddit connector: a kafka connector running in standalone mode to stream reddit posts and comments

Running

invoke compose.up

Reddit kafka connector

A kafka connector used to call the reddit API and read posts and comments for a list of subreddits, then write them to two kafka topics. Here is the documentation of this service.

Stream Apps

It's a java project contains 3 modules:

kafka-producers: a module used to fake kafka messages to test the different services
kstreams-apps: a module contains some applications based on confluent kstreams
spark-streaming-apps: a module contains some applications based on spark structured streaming

About

A repository contains some examples for stream processing applications using spark structured streaming, Kafka Streams, and some other tools like Apache Hudi...

hudi kafka kafka-connect kafka-streams spark spark-streaming

Languages

Language:Java 85.4%Language:Python 10.4%Language:Dockerfile 3.3%Language:Shell 1.0%