runalddsouza / hudi-kafka

Data ingestion using Hudi DeltaStreamer and Kafka

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hudi-kafka

This project has two components:

  • Kafka AvroProducer -> produces cryptocurrency data.
  • Hudi DeltaStreamer -> ingests the data from Kafka and writes to hudi tables.

Producer

  • Install packages: pip install -r requirements.txt
  • Start Producer: python producer/producer.py --topic <topic-name> --bootstrap-servers <broker-server> --schema-registry <schema-registry-url> --log-file <log-file-path>

Consumer

Refer Documentation for configuration.

  • Install Spark
  • Update Hudi config and kafka topic settings in kafka-source.properties
  • Download Hudi utilities bundle and set path in hudi-delta-streamer.sh
  • Start: delta-streamer/hudi-delta-streamer.sh <spark-master> <broker-server> <schema-registry-url> delta-streamer/kafka-source.properties <output-path>

Docker Setup

  • Kafka
  • Schema Registry
  • Zookeeper
  • Producer
  • Consumer (Hudi DeltaStreamer)

Steps:

  • Clone repository
  • Run: cd docker
  • Start services: docker-compose up

About

Data ingestion using Hudi DeltaStreamer and Kafka


Languages

Language:Python 76.6%Language:Shell 14.8%Language:Dockerfile 8.6%