san089 / SF-Crime-Statistics

A Kafka and Spark Streaming Integration project : SF Crime Statistics with Spark Streaming

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kafka and Spark Streaming Integration

Overview

In this project, we provide a statistical analyses of the data using Apache Spark Structured Streaming. We created a Kafka server to produce data, a test Kafka Consumer to consume data and ingest data through Spark Structured Streaming. Then we applied Spark Streaming windowing and filtering to aggregate the data and extract count on hourly basis.

Environment

  • Java 1.8.x
  • Python 3.6 or above
  • Zookeeper
  • Kafka
  • Scala 2.11.x
  • Spark 2.4.x

How to Run?

Start Zookeeper and Kafka Server

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

Run Kafka Producer server

python kafka_server.py

Run the kafka Consumer server

python kafka_consumer.py

Submit Spark Streaming Job

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py

kafka consumer console output

Consumer console output

Streaming progress reporter

Progress Reporter

Output

output

About

A Kafka and Spark Streaming Integration project : SF Crime Statistics with Spark Streaming


Languages

Language:Python 78.4%Language:Jupyter Notebook 21.1%Language:Shell 0.4%