ptt-crawler

This project scrapes/crawls post content and comments from PTT website, and implements neural CKIP Chinese NLP tools on the scraped data asynchronously.

Documentation

1. Installation

Python version
- python == 3.7.5

Clone repository

git clone git@github.com:Retr0327/ptt-crawler.git

Install Requirement

cd scraptt && pip install -r requirement.txt

2. Usage

Commands

scrapy crawl <spider-name> -a boards=BOARDS [-a all=True] 
            [-a index_from=NUMBER -a index_to=NUMBER]   
            [-a since=YEAR] [-a data_dir=PATH]


positional arguments:
<spider-name>           the name of ptt spiders (i.e. boards, ptt_post, and ptt_post_segmentation)
-a boards=BOARDS        specify which ptt boards

Crawl all the posts of a board:
Crawl all the posts of a board from a year in the past:
Crawl the posts of a board based on html indexes:
Crawl the posts of multiple boards:

If you want to save the (segmented) post data, simply add the command, such as -a data_dir=./ptt_data, to the command

Contact

If you have any suggestion or question, please do not hesitate to email me at philcoke35@gmail.com

About

An asynchronous Python web scraper for extracting post content and comments from PTT website.

scrapy

Apache License 2.0

Languages

Language:Python 100.0%