jf20541 / KMeansDbscanPCA

Two Unsupervised Learning clustering models (KMeans & DBSCAN) and PCA for reduction dimensionality. Applied techniques to find the optimal hyperparameters and visualized the outputs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KMeansDbscanPCA

Objective

An Unsupervised Learning clustering model, implementing KMeans and Density-Based Spatial Clustering of Application with Noise (DBSCAN) after reducing the dimensionality using Principal Component Analysis. Apply techniques to find similar characteristics of different US counties for predatory marketing, government campaigns, business development, etc by examing 34 attributes.

Repository File Structure

├── src          
│   ├── kmean.py             # Optimal K-cluster for the population of counties based on the selected 7 PCA attributes 
│   ├── dbscan.py            # Optimal epsilon and minPoint value with the selected 7 PCA attributes
│   ├── pca.py               # Dimensionality reduction, plotting, and deploying the PCA model
│   ├── data.py              # Cleaned the data
│   └── config.py            # Define path as global variable
├── inputs
│   ├── train.csv            # Training dataset
│   └── population_seg.csv   # Cleaned data
├── notebooks            
│   └── population_seg.ipynb # Exploratory Data Analysis, Visualization and Feature Engineering 
├── plots
│   ├── DBSCAN_PCA.png       # Frequency clustering DBSCAN & PCA
│   ├── KMeansPCA2.png       # Frequency clustering KMeans & PCA
│   ├── Kmeans_Elbow.png     # Optimal K using Elbow Method
│   ├── optimial_epsilon.png # Optimal Epsilon using KNN
│   ├── pca_explained_bar.png  # Explained Variance
│   └── pca_explained_var.png  # N-Components for Explained Variance
├── requierments.txt         # Packages used for project
└── README.md

Model and Vizualization

KMeans

K-Means finds the optimal centroids (number of clusters is represented by K) by assigning data points to clusters based on the defined centroids using Elbow Method. K-Means is sensitive to outliers and the number of dimensions increases its scalability decreases.

  • n_clusters: Find the optima K-value by plotting and using Elbow Method
  • max_iter: Maximum number of iterations of the k-means algorithm for a single run
  • n_init: Number of time the k-means algorithm will be run with different centroid seeds
Finding Optimal K (Elbow Method) Plot

DBSCAN

An unsupervised algorithm for density-based clustering that identifies distinctive clusters within a high point density which can signal outliers natively. The model has two hyper-parameters Epsilon and Minimum Points. Epsilon is the radius of the neighborhood around any point. Minimum Point is the minimum number of points within the Epsilon radius.

  • eps: Used KNN to find the optimal Epsilon value. The maximum distance between two samples for one to be considered as in the neighborhood of the other
  • min_samples: The number of samples in a neighborhood for a point to be considered as a core point
Finding Optimal Epsilon using K-Nearest Neighbor Plot

Principal Component Analysis (PCA)

A method for reducing the dimensionality of a dataset [3220, 34]. With 34 features, it can cause more processing time and noise. Using the explained variance ratio (percentage of variance explained by each of the selected components) to select the number of principal components.

  • n_components: Number of prinicpal components. Selecting based on 80% explained variance percenta
PCA Explained Variance Ratio for N-Components Plot

Data

Population Segmentation Data and US Census Data

'TotalPop', 'Men', 'Women', 'Hispanic', 'White', 'Black', 'Native',
'Asian', 'Pacific', 'Citizen', 'Income', 'IncomeErr', 'IncomePerCap',
'IncomePerCapErr', 'Poverty', 'ChildPoverty', 'Professional', 'Service',
'Office', 'Construction', 'Production', 'Drive', 'Carpool', 'Transit',
'Walk', 'OtherTransp', 'WorkAtHome', 'MeanCommute', 'Employed',
'PrivateWork', 'PublicWork', 'SelfEmployed', 'FamilyWork', 'Unemployment'

About

Two Unsupervised Learning clustering models (KMeans & DBSCAN) and PCA for reduction dimensionality. Applied techniques to find the optimal hyperparameters and visualized the outputs.

License:MIT License


Languages

Language:Jupyter Notebook 99.0%Language:Python 1.0%