reddit reddit-post thread reddit-user fivethirtyeight natural-language-processing problem-statement

Web scrapping Reddit- Natural Language Processing

**Photo from (http://www.quertime.com/article/15-reddit-user-and-data-analytic-tools/)

Scenario

You're fresh out of your Data Science bootcamp and looking to break through in the world of freelance data journalism. Nate Silver and co. at FiveThirtyEight have agreed to hear your pitch for a story in two weeks!

Your piece is going to be on how to create a Reddit post that will get the most engagement from Reddit users. Because this is FiveThirtyEight, you're going to have to get data and analyze it in order to make a compelling narrative.

Project Summary

In this project, I practiced two major skills. Collecting data by scraping a website and then building a binary predictor.

There are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, the problem statement will be: What characteristics of a post on Reddit are most predictive of the overall interaction on a thread (as measured by number of comments)?

Methods for acquiring the data will be scraping the 'hot' threads as listed on the Reddit homepage. I will be looking into these features below:

The title of the thread, the subreddit that the thread corresponds to, the length of time it has been up on Reddit, and the number of comments on the thread. Once the data is aquired, I will build a classification model that, using Natural Language Processing and any other relevant features, predict whether or not a given Reddit post will have above or below the median number of comments.

About

Using web-scraping techniques and Natural Language Processing in order to determine how Reddit submissions are ranked.

reddit reddit-post thread reddit-user fivethirtyeight natural-language-processing problem-statement

Languages

Language:Jupyter Notebook 100.0%