You have kafka, spark, and docker installed
Deactivate base
conda environment (Assuming that you are using Anaconda and your terminal default to the base
environment).
conda deactivate
Determined the python version and location
python3 -V
which python3
Updated pip
python3 -m pip install --upgrade
Installed ensurepip
.
- Use the second command if there are any dependency issues but first ensure you've installed
aptitude
.
sudo apt install python3.10-venv
sudo aptitude install python3.10-venv
Created and activated a virtual environment
- Third command is for deactivation
python3 -m venv stream_votes
source stream_votes/bin/activate
deactivate
Downloading the required packages
pip install -r requirements.txt
- Incase of slow internet speed disconnection, run with the last digit set to your preference.
pip config set global.timeout 300
-
To fix issues with installation of
psycopg2
, run either of the following command and then re-install the packages- Run the second one if the first one doesn't work to either downgrade or upgrade the dependencies
sudo apt-get install libpq-dev
sudo aptitude install libpq-dev
-
If the previous commands don't work change
psycopg2
topsycopg2-binary
in therequirements.txt
file. -
Just incase you want to remove the environment:
sudo rm -r stream_votes
Ensure there is no running containers first before running the command below.
docker compose up -d
Handles how and what data is entered into the database.
Here we have the candidates
, voters
, and votes
table.
- Note that a voter can only vote once but hence the unique nature in the
votes
table.
Data for candidates and voters is obtained from the Random User Generator API
Check for the kafka topic and its content
kafka-topics --list --bootstrap-sever broker:29092
kafka-console-consumer --topic voters_topic bootstrap-server broker:29092
Connecting to postgres through CLI
pgcli -h localhost \
-p 5433 -u postgres \
-d votingqu
Pyspark version is 3.4.2 go to this website
Download the postgres jdbc driver here
Created a checkpoint folder to store already processed data from the streams
mkdir checkpoints checkpoints/checkpoint1 checkpoints/checkpoint2
Check topic list
kafka-topics.sh \
--list \
--bootstrap-server localhost:9092
Create topic
kafka-topics.sh \
--create \
--bootstrap-server localhost:9092 \
--topic aggregated_votes_per_candidate
kafka-topics.sh \
--create \
--bootstrap-server localhost:9092 \
--topic aggregated_turnout_per_location
Check progress of in console
kafka-console-consumer.sh \
--topic aggregated_votes_per_candidate \
--bootstrap-server localhost:9092
Running things in streamlit
streamlit run streamlit-app.py