The kafka-delta-ingest project aims to build a highly efficient daemon for streaming data through Apache Kafka into Delta Lake.
This project is currently highly experimental and evolving in tandem with the delta-rs bindings.
-
Compile:
cargo build
-
Launch Kafka -
docker-compose up
-
Run kafka-delta-ingest (with a short 10s allowed_latency):
RUST_LOG=debug cargo run ingest example ./tests/data/example --allowed_latency 10 -t 'modified_date: substr(modified,`0`,`10`)' 'kafka_offset: kafka.offset'
-
In separate shell, produce messages to
example
topic, e.g.:
echo "{\"id\":\"1\",\"value\":1,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"2\",\"value\":2,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"3\",\"value\":3,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"4\",\"value\":4,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"5\",\"value\":5,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"6\",\"value\":6,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"7\",\"value\":7,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"8\",\"value\":8,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"9\",\"value\":9,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"10\",\"value\":10,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
-
Watch the delta table folder for new files
-
watch ls tests/data/example
-
cat tests/data/example/_delta_logs/00000000000000000001.json
-
-
Check committed offsets
$KAFKA_HOME/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group kafka-delta-ingest:example
-
For unit tests, run
cargo test
. -
Integration tests that depend on Kafka are currently marked with
#[ignore]
. RunRUST_LOG=debug cargo test — --ignored --nocapture
to run these locally after runningdocker-compose up
to create a local Kafka.
A tarball containing 100K line-delimited JSON messages is included in tests/json/web_requests-100K.json.tar.gz
. This file will be used to play onto Kafka in future integration and load tests and should be useful for exploratory testing as well.
{
"meta": {
"producer": {
"timestamp": "2021-03-24T15:06:17.321710+00:00"
}
},
"method": "DELETE",
"session_id": "7c28bcf9-be26-4d0b-931a-3374ab4bb458",
"status": 204,
"url": "http://www.youku.com",
"uuid": "831c6afa-375c-4988-b248-096f9ed101f8"
}
-
meta.producer.timestamp
- ISO8601 timestamp when test message was generated -
method
- Random choice from list of HTTP methods -
session_id
- Random choice from list of 200 generated UUIDs -
status
- Random choice from list of HTTP status codes -
url
- Random choice from list of top 100 popular websites from wikipedia -
uuid
- New uuid - unique per message
URLs sampled for the test data are sourced from Wikipedia’s list of most popular websites - https://en.wikipedia.org/wiki/List_of_most_popular_websites.