PttScrapyMongoDB
Environment
- MacOS Sierra
Usage
Preparation
- install mongoDB server.
pip install -r requirements.txt
Command line
- in project folder
- run command in shell:
scrapy crawl ptt -a board=EZSoft -a pages=2
- board: boardName
- pages: the number of crawling page.
- title_lim
Class
PttCrawlerMongoDB
PttCrawlerMongoJson
other function
- timer spider
MongoDBPortSettings
- ptt_crawl/spiders/settings.py
ITEM_PIPELINES = {
'ptt_crawl.pipelines.MongoDBPipeline': 300
}
MONGODB_HOST = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "ptt"
MONGODB_DOC = "ptt"
Log
V0.2 2019.2.10
- rewrite xpath
- use pipeline to save data into MongoDB.
- fix bug: can't crawl all text in article.
- Performance improvement: the crawl time decreases.
V0.1 2019.2.1
- initial version