This repository includes artifacts required to compare an RDBMS like postgres with Kafka we primarily look at the cost of adding triggers to achieve CEP like a Stream processing system does and analyze the cost to insertion speed and we then try to see how decoupling ingestion and processing is beneficial with kafka as a storage system and KSQL (Kafka Streams) for processing. Thanks for meetup.com for providing a Streaming API for RSVPs
- We use Streaming API from meetup.com for our experiment
- Clone this repository
- Run
save.py
to save data from APIcd datagen pip3 install requests python3 save.py
- Once we have sufficient data we can transfer this to a GCS bucket with name
$BUCKET_NAME
for use in benchmark
PS. Ensure that the VM have sufficient cores and memory to run kafka and consumers in parallel eg. 15 cores 50GB memory
- Run
sudo apt update && sudo apt upgrade
- Copy the data generated from previous step from GCS Bucket into VM
gsutil cp -R gs://$BUCKET_NAME .
- Install
docker
as described here Install Docker via Convenience Scriptcurl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo usermod -aG docker $USER
- Install
docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.25.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose
- Install pip3 to install dependencies
sudo apt install python3-pip
- Clone this repository onto the VM
git clone https://github.com/snithish/advanced-databases_ApacheKafka.git
- Start containers
cd advanced-databases_ApacheKafka docker-compose up -d
- (Optionally) Setup remote port forwarding to view the control center, to be run in work station
ensure that SSH has been setup up to connect to instances GCP - Connecting to Instances
ssh -L 5000:localhost:9021 [USERNAME]@[EXTERNAL_IP_ADDRESS]
- Move
datafile
todatagen/data.txt
- Setup database
docker-compose exec postgres psql -Ubenchmark -f /datagen/initdb.sql
- We need to install
psycopg2
to connect to postgrespip3 install psycopg2-binary
- Run benchmark
python3 ingest_postgres.sql
- Remove trigger from
initdb.sql
one at a time and repeat the steps above
- Create a new topic to inject data for benchmark
docker-compose exec broker kafka-topics --create \ --zookeeper zookeeper:2181 \ --replication-factor 1 --partitions 60 \ --topic meetup
- Open multi tabs or panes, suggested to use a terminal multiplexer like TMux or open multiple SSH connections and run
docker-compose exec ksqldb-cli ksql http://ksqldb-server:8088
- Create
stream
fromkafka topic
Refer to:kafka-processing.sql
- Install
kafka-python
to act as producer for kafkapip3 install kafka-python
- Execute the other queries in multiple panes / windows
- Run benchmark
python3 ingest_kafka.py
- Multiplex and run multiple producers to see high throughput
- Repeat steps 4 - 6 for various query combinations