Vamsi-Krishna-24 / Customer-Segmentation-Sentiment-Analysis-and-Predictive-Modelling.

This project carefully studies the customer reviews of a airline company, around 10,000+ reviews are collected through webscrapping and and by sentiment analysis captured the expierence of the customers. And based on that designed a Machine learning algorithm which is a random forest classifier to predict customers who are likely to book seats.

Home Page:https://www.airlinequality.com/airline-reviews/british-airways

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Important

In American politics, a party leveraged detailed data to categorize individual voters, tailoring content to sway opinions. My project draws inspiration from this, exploring how businesses apply a similar conceptβ€”customer segmentation. It involves sorting customer feedback to understand and strategically influence their buying choices. Just like in politics, where personalized messages impact voters, businesses target customers based on preferences, underscoring the crucial role of customer segmentation in shaping both political opinions and purchasing decisions.

🎯 Customer Segmentation

πŸ“Š 1.A Detailed Brief

1.1Sentiment Analysis and Predictive Analysis

For any business to run in the most efficient and successful way, the primary aspect of that business should be customer retention. In other words, it's the ability of the company to turn customers into repeated buyers of their product or service. For any Files related to data or code have a look on the left panel.

Companies effectively using customer segmentation strategies experience a 36% increase in revenue (Source: Aberdeen Group).

1.2🎯 Customer Segmentation

Within the vast array of reviews a company receives, we meticulously classify them into three distinct categories: positive, neutral, and negative. This segmentation enables us to effectively target positive, negative, and neutral customer segments with specific marketing strategies, maximizing the benefits for the company.

Customer Retention Strategy Customer Segmentation

Technologies Used

  • Python
  • AWS
  • Docker
  • S3 Buckets
  • Streamlit

(back to top)

2. Data Collection πŸ“ˆ

Nothing can be built from a void. Ever wondered why every supermarket, multiplex, airlines, any form of proper business asks for feedback? Is it just for improvising? Absolutely no, so what else is the question. Let's dive deep and see what else...

The complete data that is dealt with in this project is here.

There are two types of data:

  1. Customer Reviews Data:

    • The first dataset consists of customer reviews. These reviews gathered as feedback on the company's website, have been web-scraped using a library called BeautifulSoup in Python. For a more detailed overview, you can find the source code.
    • The actual customer feedback that is collected through scraping. There are a total of 10,000+ records collected.
  2. Customer Booking Data:

    • The second type of data is the customer booking data, which includes details about flight time, price, duration, etc.
    • You can access the customer booking data here. This dataset comprises 50,000+ records with 7+ main metrics.

Web Scraping Diagram


import requests
from bs4 import BeautifulSoup
import pandas as pd

All the code for data collection is available here.

3.DATA PRE-PROCESSING

The term "Pre-processing" itself says that data is handled "PRE" (before) processing into the code. It has two main steps Cleaning and feature engineering.

3.1 Data Cleaning

The available or collected data is often in a raw state, characterized by duplicate entries, null values, and a lack of structure. This raw state can be compared to chopping vegetables before cooking. To prepare the data for analysis, it requires cleaning and transformation:
  • Removal of duplicate entries.
  • Addressing null values.
  • Changing the unstructured format to a structured one.

3.2 Feature Engineering.

From the data that is available in hand, creating new metrics(columns) or new Features in data, results in analyzing the data in the required state.

Just like preparing ingredients before cooking, this data-cleaning process is essential before diving into the analysis or data analysis "recipe." The objective is to have a clean and organized dataset, ready for analysisβ€”a necessary step before embarking on the analytical journey, much like having ingredients neatly prepared before starting to cook a delicious meal. Here a new feature is added that is SENTIMENT after performing sentiment analysis, to study the percentage of customers who stays Positive, Negative and Neutral towards a service.


# Apply sentiment analysis and categorization to each comment
data['Sentiment'] = data['Reviews'].apply(lambda x: categorize_sentiment(analyse_sentiment(x)))

3.3 Understanding Different Possibilities: How the Data Can Be UsedπŸ€”

Every time feedback is given, it's molded to segment users so that each individual part is dealt with accordingly by the sales and marketing team.

In-depth, understanding whether a customer will choose to engage with a company again hinges on the feedback they provide. Although the feedback process is inherently straightforward, its significance lies in our ability to predict a customer's likelihood of returning.

🐍 Python Requirements

 
pandas
numpy
textblob
scikit-learn

To view the requirements.txt file here.

4. In-detail DATA ANALYSIS and INSIGHTS πŸ”πŸ“•

4.1:NLP and Sentiment Analysis: NLP and sentiment analysis play a crucial role in identifying customer satisfaction levels:

  • Satisfied Customers: 64%+
  • Unsatisfied Customers: 35%+
  • Neutral Responses: 11%
NLP and Sentiment Analysis
↑ Explore NLP and Sentiment Analysis

4.2:Keyword Identification: Identifying the most dominant factors customers look for:

  • Flight
  • Food
  • Service
  • Trip Verified
  • ...
Keyword Identification
↑ Discover Dominant Keywords

4.3:Daily basis flight booking data

  • Weekly Decline in Booking .
  • Highest booking on Monday and gradually decrease by 5% every day.
  • After the fall of bookings from Monday to Saturday again there is a slight raise that occurs on sunday
Daily Flight booking data
↑ Flight booking data on daily basis

5.End-to-end machine learning pipeline or workflow

5.1 S3 Bucket Creation and Data Upload: πŸͺ£

S3 buckets are like digital containers in the cloud that store various types of data securely. They act as virtual warehouses accessible from anywhere on the internet, offering features like versioning, access control, and the ability to host static websites. In simple terms let us say photos in our mobile is synced into google Photos, thereby managing the storage and can be accessed from anywhere.The simple logic applies here to...

S3 Buckets

  
import boto3

#Bucket and file paths bucket_name = 'customersegmentation60k' customer_booking_path = 'sagemaker/customersegmentation/sklearnconatiner/customer_booking.csv' sentiment_analysis_path = 'sagemaker/customersegmentation/sklearnconatiner/Polarity.csv'

#Creating an S3 client s3 = boto3.client('s3')

#Upload required data s3.upload_file('Vk/downloads/customer_booking.csv', bucket_name, customer_booking_path) s3.upload_file('Vk/downloads/Polarity.csv', bucket_name, sentiment_analysis_path)

5.2 Data Access through SageMaker:

In the machine learning workflow, data stored in an S3 bucket is seamlessly accessed through SageMaker, where a Jupyter Notebook is employed for developing and fine-tuning the machine learning model. For instance, consider a scenario where a dataset of housing prices (stored in S3) is analyzed and a predictive model is trained using SageMaker's Jupyter environment for housing price predictions.

  
## Importing all the neccessary libraries.
import boto3
import sagemaker
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sagemaker import get_execution_role
from sagemaker.model import Model
import matplotlib.pyplot as plt
  

5.3 Machine Learning Code Development in Jupyter Notebook:

S3 stores data for ML in a bucket, accessed by SageMaker's Jupyter. For instance, predicting house prices using a dataset stored in S3 via SageMaker's Jupyter.

5.4 Deployment of Sagemaker Endpoints:

First, in Jupyter Notebook 1, we develop machine learning code on Amazon SageMaker using customer segmentation data. This code becomes a model deployed as an endpoint.

Now, in Jupyter Notebook 2, we use Streamlit. It acts as an interface for users to input data. Streamlit takes care of collecting user input and generates requests. These requests are then forwarded to the SageMaker endpoint, where the model processes them to make predictions. In essence, Notebook 1 builds the model, and Notebook 2 handles user interactions and facilitates predictions using that model.

  
  predictor = model.deploy(instance_type="ml.m5.large", endpoint_name="your-endpoint-name")
  

6. πŸ› οΈ Building a Predictive Model

The core of this project lies at building an model and that's here....

Additional data is gathered and collected, including booking data (buying data). Along with the web scraping data, this collected data is used to build a predictive machine learning model. In this case, the model used is the random forest classifier, which can predict whether or not a given customer with given metrics will buy the product or service of the company.

  
# Create a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=1)
  

It's not just dumping data in the model, it is about the selection of that model, to tell the story behind the data and predicting something that makes sense in a practical way ! Random Forest Classifier Usage:

The Random Forest classifier is utilized by inputting data from the customer booking data matrix. This matrix incorporates features like flight time, hours traveled, and other relevant metrics. The primary goal is to predict whether the customer will make a repeat purchase from the company.

Model Performance:

The model's predictive performance is assessed using the Precision score, achieving an accuracy rate of 80%. This signifies a commendable level of precision in predicting positive instances, indicating a successful outcome.

Random Forest Classifier Model Predictive Analytics

Few important metrics taken as input for a good predicted output include: 'purchase_lead', 'length_of_stay', 'flight_duration' to predict the booking output. The model achieved an accuracy score of 80%.

πŸ“š References

Conclusion

After meticulously collecting and analyzing customer feedback, it's evident that customers fall into two distinct types: those who actively provide feedback (Type 1) and those who do not (Type 2). The essence of this project revolves around understanding customer behavior and satisfaction levels.

The primary objective is to segment customers based on their satisfaction levels. Each segment is then channeled into specific marketing strategies, ensuring a tailored approach to maximize the chances of customers returning for the same product or service. By studying and responding to customer feedback, businesses can enhance customer satisfaction and loyalty, ultimately contributing to their overall success.

About

This project carefully studies the customer reviews of a airline company, around 10,000+ reviews are collected through webscrapping and and by sentiment analysis captured the expierence of the customers. And based on that designed a Machine learning algorithm which is a random forest classifier to predict customers who are likely to book seats.

https://www.airlinequality.com/airline-reviews/british-airways


Languages

Language:Python 99.2%Language:Jupyter Notebook 0.7%Language:PowerShell 0.1%Language:Shell 0.0%