notnews / top10

Top 10 News! Scraping and Parsing Home pages and Top 10 Lists on News Sites

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Top 10 News! Scraping and Parsing Home pages and Top 10 Lists on News Sites

We scraped and parsed the homepages, politics pages, and top10 lists of prominent news sites for 2012 and 2016--2017. We did all this in 2016--2017, and hence the 2012 data exclusively comes from Internet Archive. For 2016--2017, the data mostly comes from scraping live sites but some of the data---where we realized much too late that we wanted to scrape the site---also comes from Internet Archive.

Data

For summary of the data, see here. The raw data (HTML files) and the CSVs with the parsed data are posted here.

Scripts

The scripts scraping the Internet Archive still run nicely. The scripts for scraping current homepages, politics pages, and top10 lists have largely survived except format changes mean they may don't work nowadays.

To learn how to we set up live scraping and parsing the data, including setting up monitoring, see here.

License

Released under the MIT License

Authors

Suriyan Laohaprapanon and Gaurav Sood

About

Top 10 News! Scraping and Parsing Home pages and Top 10 Lists on News Sites


Languages

Language:Python 99.7%Language:Shell 0.3%