该工具主要用来解析原始的爬虫数据文件,功能包括:
- 从原始html中提取正文
- 对正文进行分词
- 将解析出来的格式存入mongodb数据库
- 目前数据源有博客,新闻和微博
python main.py
用来解析txt文本文件(不包括微博数据)python parse_weibo.py
用来解析在weibo_data collection里面的微博数据, 并将数据放入weibo collection中。python parse_bbs.py
用来解析在bbs_dadta collection中的bbs数据, 并将数据放在bbs collection中。
- Anaconda
- mongodb
- Python 2.7
- BeautifulSoup
- jieba
- pymongo
git clone https://github.com/KeithYue/WebdataPipeline.git
into workspace andcd WebdataPipeline
- edit the config.ini file, in which you need to config the ip:port of mongodb instance and where the raw data has been stored, different kinds of data has been mapped to different directories.
- run
python main.py
. The program would use thecpu_count-2
numbers of cores in the current machine. - example of output: