konhay / weibo-spider

Crawler program for popular Chinese social media Sina Weibo (mobile site). It is often used to build unstructured and image datasets.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

weibo-spider

Introduction

This is a Sina Weibo (mobile site) crawler program. Weibo is the most popular social media in Chinese Mainland. We clean and organize the data crawled, based on which word-cloud figure can be carried out.

Code Structure

scrapy startproject [yourproject] will create a scrapy project.

scrapy.cfg is the configuration file for the project.

setting.py is used to set the parameters of the request, use the proxy, crawl the data after file saving.

/spider/sinaSpider.py is the main code of the crawler.

middlewares.py is the middleware for scrapy's request and its related processing. It is mainly the rotation of UserAgent, Cookies and agents.

items.py is the definition file of the data structure that needs to be extracted.

pipelines.py is to further process the data extracted from items, and the connection to mongdb is in this.

Libraries

scrapy is an application framework for crawling website data and extracting structured data. It is a very powerful and easy-to-use crawler framework that not only provides some basic components out of the box, but also provides powerful customization capabilities.

selenium is a tool for testing Web applications. Selenium tests run directly in the browser, just as real users do. We use selenium mainly to simulate the behavior of users to log in to Weibo and get cookies.

PhantomJS is a non-interface, scriptable WebKit browser engine. It natively supports several web standards: DOM manipulation, CSS selectors, JSON, Canavs, etc.

Reference

web_scraping_with_python

About

Crawler program for popular Chinese social media Sina Weibo (mobile site). It is often used to build unstructured and image datasets.


Languages

Language:Python 100.0%