tianyaqu / economist-scrapy

scrapy spider and postgresql pipline for economist

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

economist-scrapy

scrapy spider and postgresql pipline for economist

diligent spider for EcoArchive

nltk

article summary is generated by nltk, you'll have to install nltk package and its corpora data: brown, averaged_perceptron_tagger and punkt first

crawl and storage

use postgresql as storage, assuing that you have a postgresql service running on localhost and default port. set your own(db, uername, password)in setting.py

use crontab tools to add crawl job each week

install

  1. install postgresql first

yum install postgresql-server postgresql-contrib

postgresql-setup initdb

  1. install python packages

pip install -r requirement.txt

  1. install nlktk corporas

ipython

import nltk

nltk.download('punkt')

nltk.download('brown')

nltk.download('averaged_perceptron_tagger')

crawl

mv collector/setting.py.example collector/setting.py

change the db user and password

scrapy crawl eco

About

scrapy spider and postgresql pipline for economist

License:GNU General Public License v3.0


Languages

Language:Python 100.0%