🍲 SoupMix 🍲

SoupMix is a platform to allow news organizations or anybody else to scrape sites to obtain data with a few simple steps. Really a work in progress.

Guiding principles:

updateable
aceessible
extensible

The Basics

This application will use a Django framework. The installation will require a few details from an Amazon AWS instance, but otherwise will allow users to start it up very quickly.

This is all aspirational.

After they launch the platform, they will have the ability to build a new scraper using Python, or prebake a scraper using internal elements or submitted "recipes." Along the way, it will demonstrate good practices (sleep for 1 second, etc.). Will include resources on the ethics of scraping.

Will present prebuilt blocks for using common scraper tactics.
Will allow the launch of new scrapers
Logs errors consistently in one accessible place
permits ability to email updates on scrapers
data analysis in django?
will it be easier to install on a raspberry pi?

Will be heavily influenced by the LA Times' Django for data analysis session at NICAR 2016.

How to start

In a terminal window, enter

redis-server

In a separate terminal window, enter

workon soupMix
cd scrapersuite
celery -A scrapersuite worker -l info

In a separate terminal window, enter

workon soupMix
python manage.py runserver

Built using

Preproduction checklist

Remove all keys
Set debug to false
See Daemonization instructions for threading celery/redis
Heavy documentation
Code cleanup

Tutorials I'm using

About

Simple system for scheduling scrapers.

Languages

Language:JavaScript 92.8%Language:Python 4.0%Language:CSS 2.7%Language:HTML 0.5%