KeithYue / WebdataPipeline

construct pipeline for different forms of web data such as weibo, bbs, news, blog. Including spider, content extraction, tokenize

WebdataPipeline

该工具主要用来解析原始的爬虫数据文件，功能包括：

从原始html中提取正文
对正文进行分词
将解析出来的格式存入mongodb数据库
目前数据源有博客，新闻和微博

使用方法

python main.py 用来解析txt文本文件(不包括微博数据)
python parse_weibo.py 用来解析在weibo_data collection里面的微博数据, 并将数据放入weibo collection中。
python parse_bbs.py 用来解析在bbs_dadta collection中的bbs数据，并将数据放在bbs collection中。

Dependencies

Environment

Anaconda
mongodb

Python modules

Python 2.7
BeautifulSoup
jieba
pymongo

Usage

git clone https://github.com/KeithYue/WebdataPipeline.git into workspace and cd WebdataPipeline
edit the config.ini file, in which you need to config the ip:port of mongodb instance and where the raw data has been stored, different kinds of data has been mapped to different directories.
run python main.py. The program would use the cpu_count-2 numbers of cores in the current machine.
example of output:

About

construct pipeline for different forms of web data such as weibo, bbs, news, blog. Including spider, content extraction, tokenize

Languages

Language:Python 100.0%