chenjr0719 / PTT-Crawler

A Python Crawler Implement for PTT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PTT-Crawler

A Python Crawler Implement for PTT with multi-processing.

What is PTT?

PTT is the biggest BBS site in Taiwan.

It is also a good place to gather information, which means: I can collect information and take analysis like Text Mining, Topic models, and others.

Requirement

PTT-Crawler is built by Python 3 and using BeautifulSoup4, requests, html.parser to gather post from PTT, then it will restore those posts into JSON files.

Make sure you already have BeautifulSoup4, requests, or you can use pip to instal them.

pip install requests
pip install BeautifulSoup4

How to Use?

You need to determine which board and how many index page you want to gather.

Run the command in terminal:

python PTT_Crawler.py $BOARD $INDEX_NUM

For example:

python PTT_Crawler.py Gossiping 2
python PTT_Crawler.py Gossiping 2 -p no

Arguments

  • -p, --push Set this argument to no, this crawler will not collect pushes. Default is yes.

Output Data Format

In .json file, article looks like:

article = {
  'Board': board,
  'Article_Title': title,
  'Article_ID': article_id,
  'Author': author,
  'Time': publish_time,
  'Push_num': push_count,
  'Bad_num': bad_count,
  'Arrow_num': arrow_count,
  'Content': content
}

And push is:

push = {
  'Tag': push_tag,
  'User': push_user,
  'Time': push_time,
  'Content': push_content,
  'ID': article_id + '_' + str(push_id)
}

About

A Python Crawler Implement for PTT

License:MIT License


Languages

Language:Python 100.0%