nikb-de / SimpleSparkMlPipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


About the project

Project uses credit card customers data to predict churning customers. It makes SparkML model, and then Using it's model inside Spark Streaming Application which reading data from one Kafka topic and produce the result to another.

Project consists of 4 docker images:

  • Kafka
  • Spark-master
  • Spark-worker-1
  • Zookeper

For starting all images use

docker-compose up


JAR should be assembled in StreamingProject


This image has 2 spark applications:

Model builder

To build a model you can run inside a docker:


It will run spark-submit for github.bakanchevn.MLModelGeneration class

It builds model and save to the file, which can be reused after.

Streaming process starter

To start a streaming pipeline which gives the result of incoming client information run this command:


It starts application which use stream which reads topic client_in from kafkas, provides model evaulation for rows, and produces result to topic client_out.


All comands mentioned below can be found in the folder


To emulate customer clients flow input you can use one of these commands:

Just some batch portion of clients:

cat BankChurners.csv | awk '{if(NR>1 && NR<200)print}' |  kafka-console-producer --topic client_in --bootstrap-server localhost:9092

If you want to run each row with some latency it can be done by

awk -F ',' 'NR>1 {print}' < BankChurners.csv | xargs -I % sh  -c '{ echo %; sleep 1; }' |  kafka-console-producer --topic client_in --bootstrap-server localhost:9092

If there is no xargs in docker, you can use bash script

You can start two or more kafka-console-producers via provided scripts to view what's happening if there are several channels.


To see results of model evaluation you can use kafka-console-consumer Example:

kafka-console-consumer --topic client_out --bootstrap-server localhost:9092 --property print.key=true --property key.separator="-" --from-beginning


  • Set up correct spark master standalone cluster instead of local run
  • Correct docker:
    • Make sbt build (or dockerizing) instead of getting a jar before build
  • Add application, which will produce data to client_in topic and consume data from client_out topic inside it own thread
  • Change data model for uniqueness. Right now, if you send to producer the same client from both parts, you hardly can understand which result of whose.



Language:Scala 86.1%Language:Shell 9.1%Language:Dockerfile 4.8%