jpbonson / WebScrapingTool

Web Scraper + API, using Scrapy and Python3/Django.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status

Web Scraping Tool

Web Scraping Tool is a set of two applications: An extensible web scraper (Scrapy) and an API that serves what was scraped.

Currently only the spider for TechCrunch is available, but the project can be extended to have more spiders.

Python. Django. PostgreSQL. Scrapy.

Observation: The API (webscrapingtool) is in Python 3, since it is the recommended version to use and the one compatible with Heroku. However, scraper (webscraper) only works for Python 2 due to a dependency with the package 'twisted', that wasn't migrated to Python3.

How to install?

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev python3 python3-dev libpq-dev postgresql postgresql-contrib nginx
pipenv install
python manage.py makemigrations
python manage.py migrate
export DJANGO_SETTINGS_MODULE=webscrapingtool.settings

How to run?

For API (Python 3):

gunicorn webscrapingtool.wsgi

For scraper (Python 2, uses the Pipfile inside webscraper):

  • local
sh run_scraper_local.sh
  • heroku
sh run_scraper_heroku.sh

How to test?

cd webscrapingtool; python manage.py test; cd ..

API Routes

Heroku: https://powerful-fjord-44213.herokuapp.com/

Outlets
Authors
Articles

TODOs:

  • improve the search articles feature to support more complex queries
  • allow sorting of results
  • generate a good documentation, maybe using Swagger
  • reference models by hiperlinks instead of PKs
  • reorganize tests to use factories, to avoid duplicated code
  • allow scraper to do POSTs in batches, to improve write performance
  • add more tests for the 'sad' paths
  • improve Article's 'tags' so it stores an array of strings + scraper should get an array of 'categories'
  • maybe: routes for authors/:authorId/articles
  • maybe: pagination

About

Web Scraper + API, using Scrapy and Python3/Django.

License:MIT License


Languages

Language:JavaScript 51.4%Language:Python 24.7%Language:CSS 23.8%Language:Shell 0.1%