saLeox / Lambda-Architecture-BigData-Pipeline-Curation

Curation of big data pipeline, applying lambda architecture, combining batching and streaming processing. Use the Springboot and Spark to implement.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lambda-Architecture-BigData-Pipeline-Curation

Architecture Overview

1. Ingest Layer

  • Set up the Kafka cluster with monitoring tools in the form of Docker compose.
  • Collect the house transation data as an event and send to Kafka topic.
  • Kafka producer & consumer.
  • Streaming ETL that will sink to Big Query.
  • Materialize the KTable of count based on the tumbling timewindow.
  • Train the Regression Model by machine learning API provided by Spark MLlib and persist into GCS.
  • Use metrics to evaluate the performance of each Model.

4. Serving Layer

Limitation and Future Enhancement

  • Integrate the Neural Network to solve regression problem by using deeplearning4j.
  • Use Flink & Alink to achieve the unified online & offline machine learning, since the Spark MLlib only provide the streaming linear regression model for the regression problem.
  • Embed the pre-processing and feature engineering into pipeline and make them automatically.

About

Curation of big data pipeline, applying lambda architecture, combining batching and streaming processing. Use the Springboot and Spark to implement.