kyleiwaniec / ML-Labs-webscraping

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ML-Labs-webscraping

Approach:

IMPORTANT: We are not concerned with the correctness of the classification. We merely use it as an opportunity to demonstrate a use case for webscraping in the context of Machine Learning. First we generate a dataset with labels, by scraping the web for both 'fake', and 'not_fake' news stories.

  1. Identify sevral sites to scrape using a list of websites flagged as 'fake' from Benedictine University. https://researchguides.ben.edu/c.php?g=608230&p=4352564
  1. Scrape articles from the above websites. We will label these articles 'fake'.

  2. Scrape a reputable news website which does not lean left or right. We will label these articles 'not_fake'.

This combined dataset will give us the ability to build a simple model for the purposes of classification of articles as 'fake' or 'not_fake'. We can start with Multinomial Naive Bayes. This is a simple bag-of-words model - its success is dependent on the assumption that 'fake' stories use different words than 'not_fake' stories. While NB is not particulary good at generating probabilities, it's been shown to be quite effective for classification. In other words, while a probability of 0.9 is pretty far off a probability of 0.5, both predict the same class (assuming we split on 0.5).

About


Languages

Language:Jupyter Notebook 99.2%Language:Python 0.8%