Myers-Briggs-Type-Indicator-MBTI-classification-Web-App

Web Application

Overview
- The MBTI Types
Data Collection
- [Reddit Data Collection Using Pushshift Reddit API Code Link](#reddit-data-collection-using-pushshift-reddit-api--code-link--https---githubcom-syedmuhammadhamza-myers-briggs-type-indicator-mbti-classification-web-app-blob-main-model-posts-20and-20comments-20data-20collection-20with-20pushshiftpy-)
Dataset
- Content
Data cleaning and preprocessing
Exploratory Data Analysis
- PROBLEM ENCOUNTERED DURING EDA
Feature engineering
- PROBLEM ENCOUNTERED DURING FEATURE ENGINEERING
Model Building and Evaluation
Model performance
Future Improvements
User interface
Productionization
Technologies
References

Overview

During my sophomore year of bachelors, I stumbled upon a book titled "Gifts differing: understanding personality type" by Isabel Briggs Myers and Peter B. Myers through a friend I met on Reddit "This book distinguishes four categories of personality styles and shows how these qualities determine the way you perceive the world and come to conclusions about what you've seen"
later that same year, I came across a self-report by the same author titled "Myers–Briggs Type Indicator (MBTI)" designed to identify a person's personality type, strengths, and preferences, and based on this study people are identified as having one of 16 personality types

The MBTI Types

ISTJ - The Inspector
ISTP - The Crafter
ISFJ - The Protector
ISFP - The Artist
INFJ - The Advocate
INFP - The Mediator
INTJ - The Architect
INTP - The Thinker
ESTP - The Persuader
ESTJ - The Director
ESFP - The Performer
ESFJ - The Caregiver
ENFP - The Champion
ENFJ - The Giver
ENTP - The Debater
ENTJ - The Commander

Around the same time, I became interested in Machine learning and data science. One of the most fascinating aspects that got me interested in ML was the fact how most dating applications don't use Machine learning for matching people this article explains how Tinder was matching people for so long let me quote some of it here

"A few years ago, Tinder let Fast Company reporter Austin Carr look at his “secret internal Tinder rating,” and vaguely explained to him how the system worked. Essentially, the app used an Elo rating system, which is the same method used to calculate the skill levels of chess players: You rose in the ranks based on how many people swiped right on (“liked”) you, but that was weighted based on who the swiper was. The more right swipes that person had, the more their right swipe on you meant for your score. Tinder would then serve people with similar scores to each other more often, assuming that people whom the crowd had similar opinions of would be in approximately the same tier of what they called “desirability.” (Tinder hasn’t revealed the intricacies of its points system, but in chess, a newbie usually has a score of around 800 and a top-tier expert has anything from 2,400 up.) (Also, Tinder declined to comment for this story.) "

Influenced by all these facts, I came up with the idea of Myers–Briggs Type Indicator (MBTI) classification where my classifier can classify your personality type based on Isabel Briggs Myers self-study Myers–Briggs Type Indicator (MBTI). The classification result can be further used to match people with the most compatible personality types

Data Collection

One of the most difficult challenges for me was the identification of what kind of data to be collected to use for classify Myers–Briggs personality types. During my final year research project at my university, I collected data from Reddit, specifically posts from mental health communities in Reddit. By analyzing and learning posting information written by users, my proposed model could accurately identify whether a user’s post belongs to a specific mental disorder, I used similar reasoning in this project, moreover to my surprise there are all 16 personality types subreddits on Reddit some even with 133k members tho there are some subreddit with only few thousand members I collected data from all theses 16 subreddits using Pushshift Reddit API

Reddit Data Collection Using Pushshift Reddit API Code Link

Dataset

Subreddit	Number of subscribers	Number of posts collected
ISTJ	12k	2600
INFJ	101K	10,000
INTJ	108K	6,400
ENFJ	18.9K	6,600
ISTP	19.3K	9,200
ESFJ	4K	800
INFP	133K	8,600
ESTP	5K	830
ENFP	68K	1200
ESTP	5K	1700
ESTJ	2.8K	700
ENTJ	20K	9000
INTP	121K	12,000
ISFJ	12K	4,400
ENTP	44K	7,600
ISFP	16K	4,100

Content

following data has been collected in a total of 16 CSV files during Data cleaning and preprocessing these 16 files has been concatenated into a final CSV file

Subreddit	Body	Date
Subreddit name of post	Text of post	Posting date

Data cleaning and preprocessing

Data cleaning and preprocessing included the following

Removing rows with Links in Body feature
Removing rows with Emojis in Body feature
Removing rows with HTML elements in the Body feature
Removing rows with punctuations in the Body feature
Removing rows with stopwords in the Body feature
Removing rows with [removed] in Body feature
Removing rows with [deleted] in Body feature
Removing rows with just numbers in the Body feature

Exploratory Data Analysis

Exploratory Data Analysis included the following

Class Imbalance check
N-gram Analysis
Generating WordClouds

PROBLEM ENCOUNTERED DURING EDA

During data collection, I noticed there were not many posts in some subreddits, reflected by the fact my code collected little amount of data for ESTJ, ESTP, ESFP, ESFJ, ISTJ, and ISFJ subreddits as a result during EDA I noticed the class imbalance situation

One of the most effective ways to solve the problem of Class Imbalance for NLP tasks is to use an oversampling technique called SMOTE( Synthetic Minority Oversampling Technique oversampling methods) hence I solved Class Imbalance using SMOTE for this problem

Feature engineering

For Multinomial Regression, I have used Bag of words and TF-IDF features of each Reddit Post

PROBLEM ENCOUNTERED DURING FEATURE ENGINEERING

during Visualization of my high dimensional embeddings I converted my higher dimensional TF-IDF features/Bag of words features into two-dimensional using Truncated-SVD then visualized my 2D embeddings the resultant visualization is not linearly separable in 2D hence models like SVM and Logistic regression will not perform well that was the rationale for Using RNN architecture with LSTM in this project

Model Building and Evaluation

For this project, I trained three models

Multinomial Logistic Regression with Bag of words features, Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems. Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.
Multinomial Logistic Regression with TF-IDF features, Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems. Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.
Recurrent Neural Networks with LSTM, Feed-forward neural networks have no memory of the input they receive and are bad at predicting what’s coming next. Because a feed-forward network only considers the current input, it has no notion of order in time. It simply can’t remember anything about what happened in the past except its training. In a RNN the information cycles through a loop. When it makes a decision, it considers the current input and also what it has learned from the inputs it received previously. A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding ’forget’ gates.

Model performance

Algorithm	Accuracy	Recall	Precision	F1
Multinomial Logistic Regression with Bag of words features Score	45.17%	0.48	0.47	0.45
Multinomial Logistic Regression with TF-IDF features Model	50.20%	0.55	0.58	0.56
Recurrent Neural Networks with LSTM	95.33%	0.70	0.69	0.69

Looking at the train and test accuracy plots or loss plots over epochs it's visible our model started to overfit after 8 epochs hence the final Model has been trained through 8 epochs

Future Improvements

The data collected for the problem is not representative enough especially for some classes where collected posts were few hundreds I tried learning curve analysis for eight different sizes of datasets and the result of the learning curve confirmed there is a gap between training and test score pointing towards High Variance problem hence in the future if more posts can be collected then the resultant dataset will improve the performance of these models

User interface

Used HTML,CSS and JavaScript,

Productionization

Deployed model to production using Flask

Technologies

Python
Scikit-learn
Matplotlib & Seaborn for data visualization
NLTK
TensorFlow
SMOTE
Sklearn for model building
Python flask for HTTP server
HTML/CSS/Javascript for UI

References

[1]. https://link.springer.com/referenceworkentry/10.1007%2F978-3-319-28099-8_50-1
[2]. https://arxiv.org/abs/1106.1813
[3]. https://www.mentalhelp.net/psychological-testing/myers-briggs-type-indicator/

bonomali / Myers-Briggs-Type-Indicator-MBTI-classification-Web-App