economist-scrapy

scrapy spider and postgresql pipline for economist

diligent spider for EcoArchive

nltk

article summary is generated by nltk, you'll have to install nltk package and its corpora data: brown, averaged_perceptron_tagger and punkt first

crawl and storage

use postgresql as storage, assuing that you have a postgresql service running on localhost and default port. set your own（db, uername, password）in setting.py

use crontab tools to add crawl job each week

install

install postgresql first

yum install postgresql-server postgresql-contrib

postgresql-setup initdb

install python packages

pip install -r requirement.txt

install nlktk corporas

ipython

import nltk

nltk.download('punkt')

nltk.download('brown')

nltk.download('averaged_perceptron_tagger')

crawl

mv collector/setting.py.example collector/setting.py

change the db user and password

scrapy crawl eco

About

scrapy spider and postgresql pipline for economist

economist scrapy postgresql

GNU General Public License v3.0

Languages

Language:Python 100.0%