Faissal-00 / Development-of-a-Real-Time-Data-Pipeline-for-the-Analysis-of-User-Profiles

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Development-of-a-Real-Time-Data-Pipeline-for-the-Analysis-of-User-Profiles

I. Environment

To set up the required environment for this project, follow these steps:

1. Install Docker

If you don't already have Docker installed, download and install it from the official Docker website: Get Docker.

2. Pull Docker Images

Run this yml to pull the necessary Docker images for Kafka, Cassandra, and MongoDB with this code :

  • Code:
    docker-compose -f docker-compose.yml up -d
    
  • docker-compose.yml :
      version: '3'
    
      services:
        zookeeper:
          image: wurstmeister/zookeeper
          container_name: zookeeper
          ports:
            - "2181:2181"
      
        kafka:
          image: wurstmeister/kafka
          container_name: kafka
          ports:
            - "9092:9092"
          environment:
            KAFKA_ADVERTISED_HOST_NAME: localhost
            KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      
        cassandra:
          image: cassandra
          container_name: cassandra
          ports:
            - "9042:9042"
      
        mongo:
          image: mongo
          container_name: mongo
          ports:
            - "27017:27017"
     
    
  • Installing and Configuring Spark
     sudo apt install default-jdk
     wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
     tar xvf spark-3.5.0-bin-hadoop3.tgz
     sudo mv spark-3.5.0-bin-hadoop3 /opt/spark
     nano ~/.bashrc
     export SPARK_HOME=/opt/spark
     export PATH=$PATH:$SPARK_HOME/bin
     source ~/.bashrc
     spark-shell
    
    

II. Start Kafka

1. Access the Kafka Container

  • Code:
    docker exec -it kafka /bin/sh  
    

2. Create a Kafka Topic

  • Code:
    kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic user_profiles  
    

3. Verify the Topic Creation

  • Code:
    kafka-topics.sh --list --zookeeper zookeeper:2181
    

4. Use the Kafka Console Producer

  • Code:
    python3 Producer.py
    

5. Use the Kafka Console Consumer

  • Code:
    spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 Consumer.py
    

III. Transformations

  • Code:
    parsedStreamDF = parsedStreamDF.withColumn("full_name", 
      concat_ws(" ", 
          col("name.title"), 
          col("name.first"), 
          col("name.last")
      )
    )
    parsedStreamDF = parsedStreamDF.withColumn("calculated_age", year(current_date()) - year(to_date(parsedStreamDF["dob.date"])))
    parsedStreamDF = parsedStreamDF.withColumn("complete_address", 
      concat_ws(", ", 
          col("location.street.number").cast("string"), 
          col("location.street.name"), 
          col("location.city"), 
          col("location.state"), 
          col("location.country"), 
          col("location.postcode").cast("string")
      )
    )
    

IV. Start Cassandra

Define the Cassandra cluster

  • Code:
    cluster = Cluster(['localhost'],port=9042)
    session = cluster.connect()
    

Define the keyspace name

  • Code:
    keyspace = "user_profiles"
    

Define the table name

  • Code:
    table_name = "users"
    

Create keyspace and table if they don't exist

  • Code:
    session.execute(f"CREATE KEYSPACE IF NOT EXISTS {keyspace} WITH REPLICATION = {{'class': 'SimpleStrategy', 'replication_factor': 1}}")
    session.execute(f"USE {keyspace}")
    

Define the table with full_name as the primary key

  • Code:
    table_creation_query = f"""
        CREATE TABLE IF NOT EXISTS {table_name} (
            full_name TEXT PRIMARY KEY,
            calculated_age INT,
            complete_address TEXT
        )
    """
    session.execute(table_creation_query)
    

Select only the columns needed for Cassandra table

  • Code:
    cassandraDF = parsedStreamDF.select("full_name", "calculated_age", "complete_address")
    

Function to save DataFrame to Cassandra

  • Code:
    def save_to_cassandra(df, keyspace, table_name, checkpoint_location):
        query = df.writeStream \
            .foreachBatch(lambda batch_df, batch_id: batch_df.write \
                          .format("org.apache.spark.sql.cassandra") \
                          .mode("append") \
                          .option("keyspace", keyspace) \
                          .option("table", table_name) \
                          .option("checkpointLocation", checkpoint_location) \
                          .save()) 
        return query
    

Start the streaming query

  • Code:
    checkpoint_location = "./checkpoint/data"
    query = save_to_cassandra(cassandraDF, keyspace, table_name, checkpoint_location)
    query.start().awaitTermination()
    

V. Documentation RGPD

Data Processing Register

In accordance with the GDPR, a detailed register documenting all personal data processing activities is maintained as follows:

1.Data Types Stored: The system processes various personal data types sourced from Kafka topics named "user_profiles". These include fields such as gender, name, location, email, login, dob (date of birth), registered, phone, and nat (nationality).

2.Processing Purposes: The data undergoes several transformations, including:

-Deriving full_name by concatenating title, first name, and last name. -Calculating calculated_age based on the date of birth. -Generating complete_address by combining address-related fields.

3.Security Measures Implemented: The processed data is written into a Cassandra database to ensure durability and reliability. The Cassandra database is secured using a keyspace named "user_profiles" and a table named "users". These tables are configured with appropriate access control and data security measures. The primary key full_name uniquely identifies each record.

About


Languages

Language:Jupyter Notebook 63.5%Language:Python 36.5%