A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.
The directory is a Django web application set-up for hosting on Heroku servers. The description of files and folders can be found below:
- manage.py - The file used to start Django server.
- requirements.txt - Containing all Python dependencies of the project.
- nltk.txt - Containing all NLTK library needed dependencies.
- Procfile - Needed to setup Heroku.
- website - Folder containing the master settings of Django application.
- templates - Folder containing HTML/CSS files.
- flair-detector - Folder containing the main application which loads the Machine Learning models and renders the results on the web application.
- data - Folder containing CSV and MongoDB instances of the collected data.
- Models - Folder containing the saved model.
- Jupyter Notebooks - Folder containing Jupyter Notebooks to collect Reddit India data and train Machine Learning models. Notebooks can be opened in Colaboratory by Google.
The entire code has been developed using Python programming language, utilizing it's powerful text processing and machine learning modules. The application has been developed using Django web framework and hosted on Heroku web server.
- Open the
Terminal
. - Clone the repository by entering
git clone https://github.com/radonys/Reddit-Flair-Detector.git
. - Ensure that
Python3
andpip
is installed on the system. - Create a
virtualenv
by executing the following command:virtualenv -p python3 env
. - Activate the
env
virtual environment by executing the follwing command:source env/bin/activate
. - Enter the cloned repository directory and execute
pip install -r requirements.txt
. - Enter
python
shell andimport nltk
. Executenltk.download('stopwords')
and exit the shell. - Now, execute the following command:
python manage.py runserver
and it will point to thelocalhost
with the port. - Hit the
IP Address
on a web browser and use the application.
The following dependencies can be found in requirements.txt:
Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using [2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest and Multi-Layer Perceptron for the task. I have obtained test accuracies on various scenarios which can be found in the next section.
The approach taken for the task is as follows:
- Collect 100 India subreddit data for each of the 12 flairs using
praw
module [1]. - The data includes title, comments, body, url, author, score, id, time-created and number of comments.
- For comments, only top level comments are considered in dataset and no sub-comments are present.
- The title, comments and body are cleaned by removing bad symbols and stopwords using
nltk
. - Five types of features are considered for the the given task:
a) Title
b) Comments
c) Urls
d) Body
e) Combining Title, Comments and Urls as one feature.
- The dataset is split into 70% train and 30% test data using
train-test-split
ofscikit-learn
. - The dataset is then converted into a
Vector
andTF-IDF
form. - Then, the following ML algorithms (using
scikit-learn
libraries) are applied on the dataset:
a) Naive-Bayes
b) Linear Support Vector Machine
c) Logistic Regression
d) Random Forest
e) MLP
- Training and Testing on the dataset showed the Random Forest showed the best testing accuracy of 77.97% when trained on the combination of Title + Comments + Url feature.
- The best model is saved and is used for prediction of the flair from the URL of the post.
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.6011904762 |
Linear SVM | 0.6220238095 |
Logistic Regression | 0.6339285714 |
Random Forest | 0.6160714286 |
MLP | 0.4970238095 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.2083333333 |
Linear SVM | 0.2470238095 |
Logistic Regression | 0.2619047619 |
Random Forest | 0.2767857143 |
MLP | 0.2113095238 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.3005952381 |
Linear SVM | 0.3898809524 |
Logistic Regression | 0.3690476190 |
Random Forest | 0.3005952381 |
MLP | 0.3214285714 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.5357142857 |
Linear SVM | 0.6190476190 |
Logistic Regression | 0.6220238095 |
Random Forest | 0.6011904762 |
MLP | 0.4761904762 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.6190476190 |
Linear SVM | 0.7529761905 |
Logistic Regression | 0.7470238095 |
Random Forest | 0.7797619048 |
MLP | 0.4940476190 |
The features independently showed a test accuracy near to 60% with the body
feature giving the worst accuracies during the experiments. Hence, it was excluded in the combined feature set.