SoupMix is a platform to allow news organizations or anybody else to scrape sites to obtain data with a few simple steps. Really a work in progress.
Guiding principles:
- updateable
- aceessible
- extensible
This application will use a Django framework. The installation will require a few details from an Amazon AWS instance, but otherwise will allow users to start it up very quickly.
This is all aspirational.
After they launch the platform, they will have the ability to build a new scraper using Python, or prebake a scraper using internal elements or submitted "recipes." Along the way, it will demonstrate good practices (sleep for 1 second, etc.). Will include resources on the ethics of scraping.
- Will present prebuilt blocks for using common scraper tactics.
- Will allow the launch of new scrapers
- Logs errors consistently in one accessible place
- permits ability to email updates on scrapers
- data analysis in django?
- will it be easier to install on a raspberry pi?
Will be heavily influenced by the LA Times' Django for data analysis session at NICAR 2016.
- In a terminal window, enter
redis-server
- In a separate terminal window, enter
workon soupMix
cd scrapersuite
celery -A scrapersuite worker -l info
- In a separate terminal window, enter
workon soupMix
python manage.py runserver
- Remove all keys
- Set debug to false
- See Daemonization instructions for threading celery/redis
- Heavy documentation
- Code cleanup
- Info from previous legtracker project
- Deploying Django on AWS
- Django tutorial
- Tutorial to add syntax highlighting
- Outputting CSVs with Django
- Automate the Boring Stuff with Python
- Integrating tasks with Django and celery But needed to add the init.py file
- Async Celery by Example and How to work with ajax and django