hussein-awala / stream-applications

A repository contains some examples for stream processing applications using spark structured streaming, Kafka Streams, and some other tools like Apache Hudi...

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stream processing demo

Requirements

pip install -r requirememts.txt
  • Grants direnv to load the given .envrc:
direnv allow

Docker compose stack

Description

This project uses different services, to simplify the testing and the configuration of these services, some docker compose files are used to run them:

  • Minio: A S3 service used for storage (used as hadoop DFS for spark jobs, and as a sink for some kafka connectors)
  • Hive: A metastore use by spark to store and manage the metadata of persistent relational entities (DBs, tables, ...), and it uses a postgres DB as a backend storage.
  • Kafka stack with following services:
    • Zookeeper: used to track the status of nodes in the Kafka cluster and maintain a list of Kafka topics and messages
    • broker: it's a kafka node used to store the messages log, it handles the client requests (produce, consume, ...)
    • schema registry: a service used to store avro schema in order to use them later to deserialize the messages
    • control center: a UI used to manage the kafka cluster
    • create reddit topics: a simple container used to init the cluster by creating the topics we need for our app
    • reddit connector: a kafka connector running in standalone mode to stream reddit posts and comments

Running

invoke compose.up

Reddit kafka connector

A kafka connector used to call the reddit API and read posts and comments for a list of subreddits, then write them to two kafka topics. Here is the documentation of this service.

Stream Apps

It's a java project contains 3 modules:

  • kafka-producers: a module used to fake kafka messages to test the different services
  • kstreams-apps: a module contains some applications based on confluent kstreams
  • spark-streaming-apps: a module contains some applications based on spark structured streaming

About

A repository contains some examples for stream processing applications using spark structured streaming, Kafka Streams, and some other tools like Apache Hudi...


Languages

Language:Java 85.4%Language:Python 10.4%Language:Dockerfile 3.3%Language:Shell 1.0%