saLeox / Lambda-Architecture-BigData-Pipeline-Curation

Curation of big data pipeline, applying lambda architecture, combining batching and streaming processing. Use the Springboot and Spark to implement.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Architecture Overview

1. Ingest Layer

  • Set up the Kafka cluster with monitoring tools in the form of Docker compose.
  • Collect the house transation data as an event and send to Kafka topic.
  • Kafka producer & consumer.
  • Streaming ETL that will sink to Big Query.
  • Materialize the KTable of count based on the tumbling timewindow.
  • Train the Regression Model by machine learning API provided by Spark MLlib and persist into GCS.
  • Use metrics to evaluate the performance of each Model.

4. Serving Layer

Limitation and Future Enhancement

  • Integrate the Neural Network to solve regression problem by using deeplearning4j.
  • Use Flink & Alink to achieve the unified online & offline machine learning, since the Spark MLlib only provide the streaming linear regression model for the regression problem.
  • Embed the pre-processing and feature engineering into pipeline and make them automatically.


Curation of big data pipeline, applying lambda architecture, combining batching and streaming processing. Use the Springboot and Spark to implement.