zhupingqi / RuiJi.Net

crawler framework, distributed crawler extractor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


年级大了, 熬不动了,卖卖茶叶,喜欢喝茶的猿人们,来看看如何泡出一杯美味的茶吧 泡泡茶


泡泡茶 https://www.paopaocha.top/

Nuget Build status CodeFactor

Build status Build status

About RuiJi Scraper

RuiJi Scraper is a RuiJi expression based browser plug-in that uses visual rule editing and generates RuiJi expressions for RuiJi.Net. firefox



We cannot withdraw donations from open collective, so we have to shut down the presentation and document servers of ruiji.net

Support us

If you would like to restart the donation support project, please contact us by email 416803633@qq.com

About RuiJi.Net

RuiJi.Net is a distributed crawl framework written in netcore.

RuiJi.Net is a self host webapi written using Microsoft.AspNetCore.Owin. Major features include distribute crawler, distribute Extractor and managed cookie.

RuiJi.Net support ip polling that using the server public network address and proxy server.


Building http://doc.ruijihg.com/]




Feature Support
webheader custom
method get/post
auto redirection support
cookie managed/custom
service point ip auto/custom Bind
encoding auto detect/by specify
response raw/string
proxy http



Extract structure

Image text


crawl use local ip automatic

var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
var response = crawler.Request(request);

crawl with special ip

var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
request.Ip = "";
var response = crawler.Request(request);

crawl with proxy

var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
request.Proxy = new RequestProxy("", 3128);

var response = crawler.Request(request);

extract url

var crawler = new RuiJiCrawler();
var request = new Request("https://www.oschina.net/blog");

var response = crawler.Request(request);
var content = response.Data.ToString();

var parser = new RuiJiParser();
var eb = parser.ParseExtract("css a.blog-title-link[href]\nexp https://my.oschina.net/*/blog/*");
var result = RuiJiExtractor.Extract(content, eb.Block);

extract tile

var crawler = new RuiJiCrawler();
var request = new Request("http://www.ruijihg.com/archives/category/tech/bigdata");

var response = crawler.Request(request);
var content = response.Data.ToString();

var parser = new RuiJiParser();
var eb = parser.ParseExtract(@"[tile]\ncss article:html

css .entry-header:text

css .entry-header + p:text
ex /Read more »/ -e");

var result = RuiJiExtractor.Extract(content, eb.Block);

extract meta

var crawler = new RuiJiCrawler();
var request = new Request("https://my.oschina.net/zhupingqi/blog/1826317");

var response = crawler.Request(request);
var content = response.Data.ToString();

var parser = new RuiJiParser();
var eb = parser.ParseExtract(@"[meta]
css h1.header:text

css div.blog-meta .avatar + span:text

css div.blog-meta > div.item:first:text
regS /发布于/ 1

css div.blog-meta > div.item:eq(1):text
regS / / 1

css #articleContent:html");

var result = RuiJiExtractor.Extract(content, eb.Block);

detect mine

var crawler = new RuiJiCrawler();
var request = new Request("http://img10.jiuxian.com/2018/0111/cd51bb851410404388155b3ec2c505cf4.jpg");
var response = crawler.Request(request);

var ex = response.Extensions;

RuiJi.Net Cluster

  1. downloaded ZooKeeper from Apache mirrors http://mirrors.hust.edu.cn/apache/zookeeper/zookeeper-3.4.12/

  2. Add the same file as zoo_sample.cfg in folder conf and rename it to zoo.cfg. and change dataDir with your

  3. Please confirm whether the Java runtime environment is installed

  4. run bin/zkServer.cmd in you zookeepr folder

  5. Start up zookeeper

  6. Compile RuiJi.Net.Cmd and run RuiJi.Net.Cmd.exe

if You see the following information

Server Start At http://x.x.x.x:x
proxy x.x.x.x:x ready to startup!
try connect to zookeeper server : x.x.x.x:2181
zookeeper server connected!

the service startup is complete!

The RuiJi.Net.Cmd.exe have to run as an administrator!
        var request = new Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");

        var response = Crawler.Request(request);

        if (response.StatusCode != System.Net.HttpStatusCode.OK)

        var content = response.Data.ToString();

        var block = new ExtractBlock();
        block.Selectors = new List<ISelector>
            new CssSelector(".entry-content",CssTypeEnum.InnerHtml)

        block.TileSelector = new ExtractTile
            Selectors = new List<ISelector>
                new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)

        block.TileSelector.Metas.AddMeta("title", new List<ISelector> {
            new CssSelector(".pt-cv-title")

        block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
            new CssSelector(".pt-cv-readmore","href")

        var r = Extractor.Extract(new ExtractRequest {
            Block = block,
            Content = content

RuiJi Expression

RuiJi Expression is a way to quickly add the rules of page extraction. The ruiji expressions are as simple and understandable as possible.Before we start, we should first understand the rule model of RuiJi.Net.

The RuiJi expression uses the structure described in the figure above to extract the pages that need to be extracted, and the extraction unit is Block, as shown in the following figure.

Selectors is a list of selector Tiles is a region that needs to be repeatedly extracted Metas is the metadata that needs to be extracted Blocks is a subBlock that needs to be extracted within Block

Image text

If you need to extract http://www.ruijihg.com/开发, you need to observe the structure of the page first.You can use F12 to look at the structure of the page

Image text

First, make sure that the result of the Block selector is unique.

Image text

The definition of Block can be as follows

css .pt-cv-view:ohtml

Continue adding tile

    css .pt-cv-content-item:ohtml

    css .pt-cv-title:text

    css .pt-cv-content:html
    ex 阅读更多... -e

You may notice \t, because both block and tile contain meta, so the tile selector part and tile meta are \t as the current tile flag.

The complete Block description structure is as follows



    tile selector





Admin Ui


Please contact me with any suggestion


my website : www.ruijihg.com

QQ交流群: 545931923




crawler framework, distributed crawler extractor

License:GNU Lesser General Public License v3.0


Language:C# 95.0%Language:CSS 5.0%Language:Prolog 0.0%Language:Standard ML 0.0%Language:Batchfile 0.0%Language:Shell 0.0%