Stream Databases and Kafka

This repository includes artifacts required to compare an RDBMS like postgres with Kafka we primarily look at the cost of adding triggers to achieve CEP like a Stream processing system does and analyze the cost to insertion speed and we then try to see how decoupling ingestion and processing is beneficial with kafka as a storage system and KSQL (Kafka Streams) for processing. Thanks for meetup.com for providing a Streaming API for RSVPs

Data Generation

We use Streaming API from meetup.com for our experiment
Clone this repository

Run save.py to save data from API

cd datagen
pip3 install requests
python3 save.py

Once we have sufficient data we can transfer this to a GCS bucket with name $BUCKET_NAME for use in benchmark

Bootstrap GCP VM

PS. Ensure that the VM have sufficient cores and memory to run kafka and consumers in parallel eg. 15 cores 50GB memory

Run
```
sudo apt update && sudo apt upgrade
```
Copy the data generated from previous step from GCS Bucket into VM
```
gsutil cp -R gs://$BUCKET_NAME .
```

Install docker as described here Install Docker via Convenience Script

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER

Install docker-compose

sudo curl -L "https://github.com/docker/compose/releases/download/1.25.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

Install pip3 to install dependencies
```
sudo apt install python3-pip
```

Clone this repository onto the VM

git clone https://github.com/snithish/advanced-databases_ApacheKafka.git

Start containers

cd advanced-databases_ApacheKafka
docker-compose up -d

(Optionally) Setup remote port forwarding to view the control center, to be run in work station ensure that SSH has been setup up to connect to instances GCP - Connecting to Instances
```
ssh -L 5000:localhost:9021 [USERNAME]@[EXTERNAL_IP_ADDRESS]
```
Move datafile to datagen/data.txt

Benchmarking Postgres

Setup database

docker-compose exec postgres psql -Ubenchmark -f /datagen/initdb.sql

We need to install psycopg2 to connect to postgres
```
pip3 install psycopg2-binary 
```
Run benchmark
```
python3 ingest_postgres.sql
```
Remove trigger from initdb.sql one at a time and repeat the steps above

Benchmarking Kafka

Create a new topic to inject data for benchmark

docker-compose exec broker kafka-topics --create \
  --zookeeper zookeeper:2181 \
  --replication-factor 1 --partitions 60 \
  --topic meetup

Open multi tabs or panes, suggested to use a terminal multiplexer like TMux or open multiple SSH connections and run
```
docker-compose exec ksqldb-cli ksql http://ksqldb-server:8088
```
Create stream from kafka topic Refer to: kafka-processing.sql
Install kafka-python to act as producer for kafka
```
pip3 install kafka-python
```
Execute the other queries in multiple panes / windows
Run benchmark
```
python3 ingest_kafka.py
```
Multiplex and run multiple producers to see high throughput
Repeat steps 4 - 6 for various query combinations

snithish / advanced-databases_ApacheKafka

Stream Databases and Kafka

Data Generation

Bootstrap GCP VM

Benchmarking Postgres

Benchmarking Kafka

About

Languages