turbolytics / sql-flow

DuckDB for streaming data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SQLFlow: DuckDB for streaming data.

SQLFlow enables SQL-based stream-processing, powered by DuckDB. SQLFlow enables writing kafka stream processing logic in pure sql.

SQLFlow supports:

  • Kafka streaming - Writing a consumer that performs SQL based transformations and publishing the output to another kafka topic.
  • JSON on the wire
  • Writing stream transformations in pure SQL, powered by DuckDB
  • Performant librdkafka python consumer

SQLFlow is currently not a good fit for:

  • Stateful stream processing
  • Wire protocols other than JSON

SQLFlow is a kafka consumer that embeds SQL for stream transformation:

Screenshot 2023-11-26 at 8 16 47 PM

Getting Started

Docker

Docker is the easiest way to get started.

  • Pull the sql-flow docker image
docker pull turbolytics/sql-flow:latest
  • Validate config by invoking it on test data
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.yml /tmp/conf/fixtures/simple.json

['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']
  • Start kafka locally using docker
docker-compose -f dev/kafka-single.yml up -d
  • Publish test messages to kafka
python3 cmd/publish-test-data.py --num-messages=10000 --topic="topic-local-docker"
  • Start kafka consumer from inside docker-compose container
docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-local-docker
  • Start SQLFlow in docker
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest run /tmp/conf/config/local.docker.yml
  • Verify output in the kafka consumer
...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}

The dev invoke command enables testing a sql-flow pipeline configuration on a batch of test data. This enables fast feedback local development before launching a sql-flow consumer that reads from kafka.

Configuration

The heart of sql-flow is the pipeline configuration file. Each configuration file specifies:

  • Kafka configuration
  • Pipeline configuration
    • Input configuration
    • SQL transformation
    • Output configuration
Screenshot 2023-11-26 at 8 10 44 PM

Every instance of sql-flow needs a pipeline configuration file.

Recipes

Coming Soon, until then checkout:

Running multiple SQLFlow instances on the same filesystem

Verifying a configuration locally

Development

  • Install python deps
pip install -r requirements.txt
pip install -r requirements.dev.txt

C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka
  • Run tests
pytests tests

Benchmarks

Methodology

Each test loads 1MM records into kafka. Each test executes sql-flow consumer until each message is processed. Each test captures the maximum resident memory during the benchmark, and the average throughput of message ingestion.

System

Hardware:
    Hardware Overview:
      Model Name: MacBook Pro
      Model Identifier: MacBookPro18,3
      Model Number: Z15G001X2LL/A
      Chip: Apple M1 Pro
      Total Number of Cores: 10 (8 performance and 2 efficiency)
      Memory: 32 GB
      Activation Lock Status: Enabled
Name Throughput Max RSS Memory Peak Memory Usage
Simple Aggregation Memory 45,000 msgs / sec 230 MiB 130 MiB
Simple Aggregation Disk 36,000 msgs / sec 256 MiB 102 MiB
Enrichment 13,000 msgs /sec 368 MiB 124 MiB
CSV Disk Join 11,500 msgs /sec 312 MiB 152 MiB
CSV Memory Join 33,200 msgs / sec 300 MiB 107 MiB

Simple Aggregate - Disk / Mem

Performs a simple aggregate. Output is significantly smaller than input.

./benchmark/simple-agg-disk.sh
./benchmark/simple-agg-mem.sh

Enriches

Performs an enrichment. Output is 1:1 records with input, but each output record is enhanced with additional information.

./benchmark/enrichment.sh

CSV Disk Join

./benchmark/csv-disk-join.sh

CSV Memory Join

./benchmark/csv-disk-join.sh

Like SQLFlow? Use SQLFlow? Feature Requests? Please let us know! danny@turbolytics.io

About

DuckDB for streaming data

License:MIT License


Languages

Language:Python 89.4%Language:Shell 9.1%Language:Dockerfile 1.4%Language:Makefile 0.1%