h4de5ing / MSpider

Spider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mspider 网页链接爬虫

爬虫功能

1,可控的线程数
2,可控的爬取深度
3,可控的爬取数量
4,可控的爬取时间
5, 可控的域名关键字(一个或多个关键字)
6,可控的聚焦关键字(一个或多个关键字)
7,可控的过滤关键字(一个或多个关键字)
8,URL相似度过滤(可控开关)
9,动态下载(自动加载js)、静态下载、混杂下载(动静比率可控)
10,数据存储(数据库为SQLite,储存分为三种模式:小数据量,大数据量)
11,内置起始URL字典
12,爬取策略:宽度优先、深度优先、随机优先
13,自动选择代理池(待完成)

BUG提交、需求提交、批评意见

联系 乌云Manning
qq 408468023

效果截图

Usage: 
       MMMM   MMMM                              MM                                         
     MMMMMMMMMMMMMMM                          MM MMM       MMMMMMM                         
    MM      M      MM                         M   MM       MM   MM                         
    M               M     MMMMMM  MMMMMMMM    MMMMMM   MMMMMM   MM   MMMMMMMM     MMMMMM   
    M    MM   MM    M   MMM   MM MM      MMM  M   MM  MM    M   MM  MM      MMM  MM    M   
    M    MM   MM    M   M     MMMM         M  M   MM M      M   MM MM   MM    M MM     M   
    M    MM   MM    M  MM    MMMM   MMMM   MM M   MMMM   MMMM   MMMM   MM     MMMM   MMM   
    M    MM   MM    M MM    MM  M   MMMM   MM M   MM M   MMMM   MM M   MMMMMMMMMMM   M     
    M    MM   MM    M M     MM MM   M     MM  M   MM MM        MM  MM      MM   MM   M     
    M    MM   MM    MMM  MMMM  MM   MM   MM   M   MM  MMM    MMM    MMM    MMM  MM   M     
    MMMMMMMMMMMMMMMMM MMMM     MM   MMMMM     MMMMMM    MMMMMM        MMMMMM    MMMMMM     
                               MM   MM                                                     
                                MMMMMM                                                     
                                                                              by Manning

Options:
  -h, --help            show this help message and exit
  -u URL, --url=URL     Start the domain name
  -t THREADS_NUM, --thread=THREADS_NUM
                        Number of threads
  --depth=DEPTH         Crawling depth
  --model=MODEL         Crawling mode: Static 0  Dynamic 1  Mixed 2
  --policy=POLICY       Crawling strategy: Breadth-first 0  Depth-first 1
                        Random-first 2
  -k KEYWORD, --keyword=KEYWORD
                        Focusing on the keywords in host
  --time=FETCH_TIME     Crawl time: The default crawl for 7 days
  --count=FETCH_COUNT   Crawling number: The default download 100000000 pages
  --proxy               The proxy pattern
  --ignore=IGNORE_KEYWORD
                        Filter keyword in URL's host or path
  --focus=FOCUS_KEYWORD
                        Focus keyword in URL's path
  --storage=STORAGE_MODEL
                        Storage mode: A small model 0  Large schemas 1  Don't
                        store  3
  --similarity=SIMILARITY
                        Similarity check: True 0  False 1

About

Spider

License:GNU General Public License v2.0


Languages

Language:Python 100.0%