To set up the required environment for this project, follow these steps:
If you don't already have Docker installed, download and install it from the official Docker website: Get Docker.
Run this yml to pull the necessary Docker images for Kafka, Cassandra, and MongoDB with this code :
- Code:
docker-compose -f docker-compose.yml up -d
- docker-compose.yml :
version: '3' services: zookeeper: image: wurstmeister/zookeeper container_name: zookeeper ports: - "2181:2181" kafka: image: wurstmeister/kafka container_name: kafka ports: - "9092:9092" environment: KAFKA_ADVERTISED_HOST_NAME: localhost KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 cassandra: image: cassandra container_name: cassandra ports: - "9042:9042" mongo: image: mongo container_name: mongo ports: - "27017:27017"
- Installing and Configuring Spark
sudo apt install default-jdk wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz tar xvf spark-3.5.0-bin-hadoop3.tgz sudo mv spark-3.5.0-bin-hadoop3 /opt/spark nano ~/.bashrc export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin source ~/.bashrc spark-shell
- Code:
docker exec -it kafka /bin/sh
- Code:
kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic user_profiles
- Code:
kafka-topics.sh --list --zookeeper zookeeper:2181
- Code:
python3 Producer.py
- Code:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 Consumer.py
- Code:
parsedStreamDF = parsedStreamDF.withColumn("full_name", concat_ws(" ", col("name.title"), col("name.first"), col("name.last") ) ) parsedStreamDF = parsedStreamDF.withColumn("calculated_age", year(current_date()) - year(to_date(parsedStreamDF["dob.date"]))) parsedStreamDF = parsedStreamDF.withColumn("complete_address", concat_ws(", ", col("location.street.number").cast("string"), col("location.street.name"), col("location.city"), col("location.state"), col("location.country"), col("location.postcode").cast("string") ) )
- Code:
cluster = Cluster(['localhost'],port=9042) session = cluster.connect()
- Code:
keyspace = "user_profiles"
- Code:
table_name = "users"
- Code:
session.execute(f"CREATE KEYSPACE IF NOT EXISTS {keyspace} WITH REPLICATION = {{'class': 'SimpleStrategy', 'replication_factor': 1}}") session.execute(f"USE {keyspace}")
- Code:
table_creation_query = f""" CREATE TABLE IF NOT EXISTS {table_name} ( full_name TEXT PRIMARY KEY, calculated_age INT, complete_address TEXT ) """ session.execute(table_creation_query)
- Code:
cassandraDF = parsedStreamDF.select("full_name", "calculated_age", "complete_address")
- Code:
def save_to_cassandra(df, keyspace, table_name, checkpoint_location): query = df.writeStream \ .foreachBatch(lambda batch_df, batch_id: batch_df.write \ .format("org.apache.spark.sql.cassandra") \ .mode("append") \ .option("keyspace", keyspace) \ .option("table", table_name) \ .option("checkpointLocation", checkpoint_location) \ .save()) return query
- Code:
checkpoint_location = "./checkpoint/data" query = save_to_cassandra(cassandraDF, keyspace, table_name, checkpoint_location) query.start().awaitTermination()
Data Processing Register
In accordance with the GDPR, a detailed register documenting all personal data processing activities is maintained as follows:
1.Data Types Stored: The system processes various personal data types sourced from Kafka topics named "user_profiles". These include fields such as gender, name, location, email, login, dob (date of birth), registered, phone, and nat (nationality).
2.Processing Purposes: The data undergoes several transformations, including:
-Deriving full_name by concatenating title, first name, and last name. -Calculating calculated_age based on the date of birth. -Generating complete_address by combining address-related fields.
3.Security Measures Implemented: The processed data is written into a Cassandra database to ensure durability and reliability. The Cassandra database is secured using a keyspace named "user_profiles" and a table named "users". These tables are configured with appropriate access control and data security measures. The primary key full_name uniquely identifies each record.