airflow delta-lake duckdb jupyter-notebook kafka scylladb slack spark thrift trino

Docker data stack

Run

Install Docker Desktop
Create .env file in the repo root by copying .env.template
Fill in the desired POSTGRES_PASSWORD value in the .env file
Build containers:

docker compose up -d --build

Jupyter

Check out the jupyterlab container logs and click on the link that looks like http://127.0.0.1:8089/lab?token=...

Trino

docker exec -it trino trino

SHOW SCHEMAS FROM db;

USE db.public;

SHOW TABLES FROM public;

Spark

docker exec -it spark-master /bin/bash

cd /opt/spark/bin
./spark-submit --master spark://0.0.0.0:7077 \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi  \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.1.jar 100

Thrift

docker exec -it spark-master /bin/bash

./bin/beeline

!connect jdbc:hive2://localhost:10000 scott tiger

show databases;

create table hive_example(a string, b int) partitioned by(c int);
alter table hive_example add partition(c=1);
insert into hive_example partition(c=1) values('a', 1), ('a', 2),('b',3);
select count(distinct a) from hive_example;
select sum(b) from hive_example;

ScyllaDB

Connect to cqlsh

docker exec -it scylla-1 cqlsh

Create keyspace

CREATE KEYSPACE data
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

Use keyspace and create table

USE data;

CREATE TABLE data.users (
    user_id uuid PRIMARY KEY,
    first_name text,
    last_name text,
    age int
);

Insert data

INSERT INTO data.users (user_id, first_name, last_name, age)
  VALUES (123e4567-e89b-12d3-a456-426655440000, 'Polly', 'Partition', 77);

Kafka

Create topic

docker exec -it kafka kafka-topics.sh --create --topic test --bootstrap-server 127.0.0.1:9092

Kafka producer

See kafka_producer.ipynb

Kafka consumer

kafka_consumer.ipynb

Airflow

Check out the .env.template file. Copy/paste airflow related variables and update their values where necessary.

Slack integration

You need to create a Slack app and setup AIRFLOW_CONN_SLACK_API_DEFAULT env variable with Slack api key. If you don't want to use this integration, remove the AIRFLOW_CONN_SLACK_API_DEFAULT variable from your .env file.

Mongo

About