shell909090 / siren

spider framework and utils written by python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

简介

siren是一套以配置为基础的爬虫系统,他的基本配置和解析系统是yaml。借助yaml的语法,他可以很轻松的定义爬虫,而不需要编写大量代码。

背景知识

使用siren,你需要了解css或者xpath,能够用css或xpath表述你需要获得的内容。知道正则表达式,能够使用正则处理简单的过滤和替换。

要良好的使用siren,你还可能需要了解robots.txt协议相关的内容。遵循别人的意愿,礼貌的获取数据,做一只绅(bian)士(tai)的爬虫。

原理简述

siren维护一个爬虫队列。在爬虫工作时,每次从队列中取出一个request。而后开始按照匹配规则进行匹配。

当匹配规则命中某个项目时,爬虫会执行一种action。例如把url下载下来,调用python代码处理。或者解析下载下来的html,再调用python代码。

siren的特殊之处在于,定义了一组预定义的爬虫处理程序。这组程序被称为parsers。通过配置,可以直接处理结果,而不需要编写python代码。

范例

name: wenku8
timeout: 10
interval: 5
result: novel:result
output: output.txt
patterns:
 
  - name: main
	desc: table of content
	parsers:
	  - css: a
		attr: href
		is: "[0-9]+\\.htm"
		call: node
 
  - name: node
	desc: node
	parsers:
	  - css: div#title
		text: yes
		result: title
	  - css: div#content
		html2text: yes
		result: content

配置讲解

细节请参考config

入门指引

请看guide

TODO

  • do something

    • bilibili
    • bt.ktxp.com
    • jd
  • regex

  • js runner

  • cookie在redis中保存:加速存取效率。

  • 队列防回环(in redis):已经爬过的维护一份列表。

  • parser in css or xpath

授权

Copyright (C) 2012 Shell Xu

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

About

spider framework and utils written by python


Languages

Language:Python 99.8%Language:Perl 0.2%