Tool to scrape and generate an RSS feed from the incredibly creepy mrkoll.se Swedish toplist. Each RSS entry represents the list as it looked at a given point in time, plus convenience links to search for the person's name on Duckduckgo, Flashback, and The Facebook.
pip install mrkoll-scraper
mrks --scrape
Scrapes the current list and saves it to the Shelve database mrks.db
in the current working directory. This operation saves the raw scraped HTML as well as a generated feedgen FeedEntry
. The data is saved in a dictionary with the current date as key, so multiple scrapes during the same day doesn't save a new entry, they only update the existing one.
mrks --regenerate
Iterates through the saved lists and re-generates FeedEntry
's from the raw HTML, in case you've made some changes in the HTML template or so, and want them applied retroactively.
mrks.wsgi
contains a beautifully simple WSGI application, that simply outputs an RSS feed based on the data currently saved.
Crontab to run --scrape
every Monday at midnight:
0 0 * * 1 cd /home/robert/mrkoll-scraper && /home/robert/mrkoll-scraper/venv/bin/mrks -s