mtdukes / soupMix

Simple system for scheduling scrapers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🍲 SoupMix 🍲

SoupMix is a platform to allow news organizations or anybody else to scrape sites to obtain data with a few simple steps. Really a work in progress.

Guiding principles:

  • updateable
  • aceessible
  • extensible

The Basics

This application will use a Django framework. The installation will require a few details from an Amazon AWS instance, but otherwise will allow users to start it up very quickly.

This is all aspirational.

After they launch the platform, they will have the ability to build a new scraper using Python, or prebake a scraper using internal elements or submitted "recipes." Along the way, it will demonstrate good practices (sleep for 1 second, etc.). Will include resources on the ethics of scraping.

  • Will present prebuilt blocks for using common scraper tactics.
  • Will allow the launch of new scrapers
  • Logs errors consistently in one accessible place
  • permits ability to email updates on scrapers
  • data analysis in django?
  • will it be easier to install on a raspberry pi?

Will be heavily influenced by the LA Times' Django for data analysis session at NICAR 2016.

How to start

  1. In a terminal window, enter
redis-server
  1. In a separate terminal window, enter
workon soupMix
cd scrapersuite
celery -A scrapersuite worker -l info
  1. In a separate terminal window, enter
workon soupMix
python manage.py runserver

Built using

Preproduction checklist

  • Remove all keys
  • Set debug to false
  • See Daemonization instructions for threading celery/redis
  • Heavy documentation
  • Code cleanup

Tutorials I'm using

About

Simple system for scheduling scrapers.


Languages

Language:JavaScript 92.8%Language:Python 4.0%Language:CSS 2.7%Language:HTML 0.5%