EasonLee1128 / PTT-Crawler

Crawl PTT website and save the data into sqlite database

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Contributors Forks Stargazers Issues MIT License

繁體中文 README.md(Traditional Chinese README.md)


Logo

PTT Crawler

Use requests, pyquery, pandas, SQLite to build a crawler to crawl the PTT website and save crawled data to sqlite database, and connect to LINE Notifiy for notification.

Table of Contents
  1. About
  2. Getting Started
  3. Usage
  4. License
  5. Contact
  6. Acknowledgements

About

PTT is one of the most commonly used social media in Taiwan.

Because the amount of daily information is too much to be completely digested, we can collect data quickly through crawlers.

In addition, storing the crawled data into the database can also be used for subsequent analysis, such as machine learning, deep learning, public opinion analysis.

date

db

Built With

Getting Started

Installation

  1. Clone the repo

    git clone https://github.com/DysonMa/PTT-Crawler.git
    
  2. Edit config.ini

    boardlist: List the board names for ptt crawling

    deadline: Set deadline for crawler stopping

    sqlite_path: Path of SQLite database for storing crawled data

    token: LINE Notification service token

Usage

First, you should create config.ini with required parameters, and save it into the path as same as the main.ipynb.

Below is a simple example:

  • import ptt package
from ptt.crawler import * 
from ptt.schedule import *
  • Check parameters
print('config_path:', config_path)
print('deadline:', deadline)
print('boardlist:', boardlist)
print('updatePageNum:', updatePageNum)
print('sqlite_path:', sqlite_path)

config_path: config.ini
deadline: 2020-12-19 00:00:00
boardlist: ['Civil', 'Soft_Job', 'NBA']
updatePageNum: 1
sqlite_path: D:\ptt_test.db

  • Name the website variable from specific board name
website = get_index('civil')
print(get_weburl(website))

https://www.ptt.cc//bbs/civil/index.html

  • Crawl the PTT website by Page
df = CrawlingByPage(website, page=2, save=True, update=True)

page page df

  • Crawl the PTT website by Date
df = CrawlingByDate(website, deadline, save=True, update=True)

date df

  • Regularly crawl the PTT website by Schedule
schedule()

schedule

Line Notification

License

Distributed under the MIT License.

Contact

Dyson Ma - Gmail

Project Link: https://github.com/DysonMa/PTT-Crawler

Acknowledgements

About

Crawl PTT website and save the data into sqlite database

License:MIT License


Languages

Language:Jupyter Notebook 74.3%Language:Python 25.7%