kingyiusuen / reddit-post-classification

Classify whether a post comes from r/MachineLearning or r/LearnMachineLearning.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reddit Post Classification

Code style: black pre-commit License

It can be tricky to find the right subreddit to submit your post. For example, it is not uncommon to find people posted in r/MachineLearning when they should have posted in r/LearningMachineLearning (which is for beginner's questions).

It will be useful if there is a tool that

  • helps new users post to the right subreddit, and
  • allows subreddit moderators to easily identify posts that might not belong to their particular subreddit.

In this project, I used Python Reddit API Wrapper (PRAW) to scrape a total of around 1,500 posts from r/MachineLearning and r/LearningMachineLearning, and trained a classifer (TF-IDF + logistic regression) to predict the subreddit of the posts. The model has an AUC score of 0.7 in the test set. Finally, I built a web app that suggests which subreddit you should post to with React.js. The backend API was built with Flask and was containerized with Docker.

Screenshot

Quick Start

Clone the repositiory, create a virtual envrionment, and install dependencies.

git clone https://github.com/kingyiusuen/reddit-post-classification.git
make venv

Start the backend server.

python backend/app.py

Start the frontend server.

cd frontend
npm start

To start scraping, follow the structure of secrets.example.json and create a secrets.json file at the project root directory. Fill out the necessary information.

Docker

Build Docker image for backend API.

docker build --tag reddit-post-classifier --file backend/Dockerfile --platform linux/amd64 .

Run the Docker image as a container.

docker run -p 8080:8080 -it --rm reddit-post-classifier

Test the container with a POST request.

curl -XPOST "http://0.0.0.0:8080/predict" -H 'Content-Type: application/json' -d '{"text": "I love machine learning"}'

About

Classify whether a post comes from r/MachineLearning or r/LearnMachineLearning.

License:MIT License


Languages

Language:Jupyter Notebook 99.3%Language:JavaScript 0.2%Language:Python 0.1%Language:CSS 0.1%Language:Makefile 0.1%Language:HTML 0.1%Language:Dockerfile 0.0%Language:Shell 0.0%