nevi-me / kafka-delta-ingest

A highly efficient daemon for streaming data from Kafka into Delta Lake

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kafka-delta-ingest

The kafka-delta-ingest project aims to build a highly efficient daemon for streaming data through Apache Kafka into Delta Lake.

This project is currently highly experimental and evolving in tandem with the delta-rs bindings.

Developing

  • Compile: cargo build

  • Launch Kafka - docker-compose up

  • Run kafka-delta-ingest (with a short 10s allowed_latency):

RUST_LOG=debug cargo run ingest example ./tests/data/example --allowed_latency 10 -t 'modified_date: substr(modified,`0`,`10`)' 'kafka_offset: kafka.offset'
  • In separate shell, produce messages to example topic, e.g.:

echo "{\"id\":\"1\",\"value\":1,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"2\",\"value\":2,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"3\",\"value\":3,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"4\",\"value\":4,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"5\",\"value\":5,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"6\",\"value\":6,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"7\",\"value\":7,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"8\",\"value\":8,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"9\",\"value\":9,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"10\",\"value\":10,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
  • Watch the delta table folder for new files

    • watch ls tests/data/example

    • cat tests/data/example/_delta_logs/00000000000000000001.json

  • Check committed offsets

$KAFKA_HOME/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group kafka-delta-ingest:example

Tests

  • For unit tests, run cargo test.

  • Integration tests that depend on Kafka are currently marked with #[ignore]. Run RUST_LOG=debug cargo test — --ignored --nocapture to run these locally after running docker-compose up to create a local Kafka.

Test Data

A tarball containing 100K line-delimited JSON messages is included in tests/json/web_requests-100K.json.tar.gz. This file will be used to play onto Kafka in future integration and load tests and should be useful for exploratory testing as well.

Pretty-printed example from the file

{
  "meta": {
    "producer": {
      "timestamp": "2021-03-24T15:06:17.321710+00:00"
    }
  },
  "method": "DELETE",
  "session_id": "7c28bcf9-be26-4d0b-931a-3374ab4bb458",
  "status": 204,
  "url": "http://www.youku.com",
  "uuid": "831c6afa-375c-4988-b248-096f9ed101f8"
}

Test data generation spec

  • meta.producer.timestamp - ISO8601 timestamp when test message was generated

  • method - Random choice from list of HTTP methods

  • session_id - Random choice from list of 200 generated UUIDs

  • status - Random choice from list of HTTP status codes

  • url - Random choice from list of top 100 popular websites from wikipedia

  • uuid - New uuid - unique per message

URLs sampled for the test data are sourced from Wikipedia’s list of most popular websites - https://en.wikipedia.org/wiki/List_of_most_popular_websites.

About

A highly efficient daemon for streaming data from Kafka into Delta Lake

License:Apache License 2.0


Languages

Language:Rust 99.9%Language:Shell 0.1%