Cluster Modelling
This repository contains a comprehensive approach to clustering analysis using K-Means and Agglomerative Clustering on a dataset of customer transactions. The project involves data preprocessing, model optimization using Optuna, and result visualization.
This project aims to perform clustering analysis on customer transaction data to identify distinct groups of customers based on their spending behavior. The project involves the following steps:
- Data Preprocessing
- Clustering using K-Means and Agglomerative Clustering
- Optimization of the number of clusters using Optuna
- Evaluation and comparison of clustering models
- Visualization of clustering results
The dataset used in this project is cluster 2.csv
. It contains customer transaction data with various features like balance, purchases, cash advance, credit limit, payments, etc.
To run this project, you need the following libraries:
- pandas
- numpy
- scikit-learn
- seaborn
- matplotlib
- optuna
You can install the required libraries using the following command:
pip install -r requirements.txt
To run the clustering analysis, follow these steps:
- Clone the repository:
git clone https://github.com/Ismat-Samadov/cluster_modelling.git
- Navigate to the project directory:
cd cluster_modelling
- Ensure you have the necessary libraries installed:
pip install -r requirements.txt
- Run the Jupyter notebook:
jupyter notebook Cluster_Modelling.ipynb
The repository contains the following files:
Cluster_Modelling.ipynb
: Jupyter notebook containing the clustering analysis and optimization code.cluster 2.csv
: Dataset used for clustering analysis.requirements.txt
: List of required libraries.
We optimize the number of clusters for both K-Means and Agglomerative Clustering using Optuna. The optimization aims to maximize the silhouette score, which measures how similar an object is to its own cluster compared to other clusters.
The optimization functions are defined as follows:
def optimize_kmeans(trial):
n_clusters = trial.suggest_int('n_clusters', 2, 10)
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans_labels = kmeans.fit_predict(numeric_df)
return silhouette_score(numeric_df, kmeans_labels)
def optimize_agg(trial):
n_clusters = trial.suggest_int('n_clusters', 2, 10)
agg = AgglomerativeClustering(n_clusters=n_clusters)
agg_labels = agg.fit_predict(numeric_df)
return silhouette_score(numeric_df, agg_labels)
# Optimize K-Means
kmeans_study = optuna.create_study(direction='maximize')
kmeans_study.optimize(optimize_kmeans, n_trials=20)
best_kmeans_clusters = kmeans_study.best_params['n_clusters']
# Optimize Agglomerative Clustering
agg_study = optuna.create_study(direction='maximize')
agg_study.optimize(optimize_agg, n_trials=20)
best_agg_clusters = agg_study.best_params['n_clusters']
We evaluate the models using the silhouette score. The evaluation results are compiled into a DataFrame for easy comparison:
results = pd.DataFrame({
'Model': ['K-Means', 'Agglomerative'],
'Default Silhouette Score': [
silhouette_score(scaled_df, kmeans_labels),
silhouette_score(scaled_df, agg_labels)
],
'Optimized Silhouette Score': [
kmeans_study.best_value,
agg_study.best_value
]
})
The optimized number of clusters and the corresponding silhouette scores are printed:
print(f"Best number of clusters for K-Means: {best_kmeans_clusters}")
print(f"Best number of clusters for Agglomerative Clustering: {best_agg_clusters}")
The clustering results are then visualized using pair plots:
sns.pairplot(cluster_data, hue='Best_KMeans_Labels')
plt.show()
If you would like to contribute to this project, please create a pull request with detailed information about the changes.
For any questions or inquiries, please contact Ismat Samadov.