vfulco / WeiboSpider

Sina Weibo Scraper by Yong HU and Mingyang LI. 2,000 posts per second.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WeiboSpider

Screen Shot

This is a sina weibo spider built by nghuyong largely tailored to run on WWBP's servers by Mingyang Li.

A detailed explanation, written by nghuyong, can be found at 微博爬虫总结:构建单机千万级别的微博爬虫系统.

Description of data structure can be found at 数据字段说明与示例.

Other Branches

The original repo by nghuyong has 3 branches:

Branch Structure Posts per Day
simple single account 100,000
master account pool 1,000,000
senior distributed pool 10,000,000

Usage

  1. Clone thre repo. Install dependencies.
    git clone git@github.com:nghuyong/WeiboSpider.git
    cd WeiboSpider
    pip install -r requirements.txt
  2. Install phantomjs, mongodb, and redis. Start the latter two.
  3. Write down the usernames and passwords of some Sina Weibo accounts in sina/account_build/account.txt. Follow the format indicated in account_sample.txt.
  4. Populate the account pool by running python sina/account_build/login.py.
  5. Populate URLs to start scraping with by issuing python sina/redis_init.py.
  6. Run scraper by running scrapy crawl weibo_spider.

Data Storage

Posts, user profiles, and user relationships (and comments optionally) are stored in the MongoDB.

Performance

With the default setting, 16GB memory, 8-core CPU, Ubuntu, and 36 processes, we are hitting an average of 2,000 posts per second.

About

Sina Weibo Scraper by Yong HU and Mingyang LI. 2,000 posts per second.


Languages

Language:Python 100.0%