Streaming Data Insights: Frequent Itemset Analysis on Amazon

Metadata

Dataset Description

The Amazon Metadata dataset is a collection of product information stored in JSON format. It includes various attributes such as product ID, title, features, description, price, image URLs, related products, sales rank, brand, categories, and technical details.

Dataset Download

You can download the Amazon Metadata dataset from here.

Dependencies

Python Libraries:
- Kafka-Python: pip install kafka-python
- Pandas: pip install pandas
- TQDM: pip install tqdm
- JSON: Already included in Python.
- RE: Already included in Python.
- Itertools: Already included in Python.
Softwares:
- Apache Kafka: Download and setup instructions here.
- Python: Download and installation guide here.

Features

Sampling and Preprocessing the Dataset
Setting up Streaming Pipeline
Implementing Frequent Itemset Mining Algorithms
Integrating with Database
Bash Script for Enhanced Project Execution

How to Use

Sampling and Preprocessing:
- Download the Amazon Metadata dataset.
- Execute preprocess.ipynb to sample and preprocess the dataset.
Streaming Pipeline Setup:
- Develop a producer application (producer.py) to stream preprocessed data.
- Create consumer applications (apriori_consumer.py, pcy_consumer.py, custom_consumer.py) to subscribe to the producer's data stream.
Frequent Itemset Mining:
- Implement the Apriori algorithm in apriori_consumer.py.
- Implement the PCY algorithm in pcy_consumer.py.
- Implement custom analysis in custom_consumer.py.
Database Integration:
- Connect each consumer to a database and store the results.
Bash Script:
- Utilize the provided bash script to initialize Kafka components and run the producer and consumers seamlessly.

Why Choose Our Solution

Efficient preprocessing techniques to handle large datasets.
Real-time streaming pipeline for immediate insights.
Implementation of popular frequent itemset mining algorithms.
Flexible database integration for data persistence.
Bash script automates project execution, enhancing usability.

Usage

Clone the repository.
Download the Amazon Metadata dataset.
Execute the preprocessing script to sample and preprocess the dataset.
Run the provided bash script to initialize Kafka components and execute the producer and consumers.
Analyze the generated frequent itemsets and association rules.

Team:

Meet the dedicated individuals who contributed to this project:

Ammar Khasif: GitHub.
Arhum Khan: GitHub.
Aaqib Ahmed Nazir: GitHub.

ammar-kashif / BDA_Assignment03