golang-collection / go-crawler-distributed

分布式爬虫项目,本项目支持个性化定制页面解析器二次开发,项目整体采用微服务架构,通过消息队列实现消息的异步发送,使用到的框架包括:redigo, gorm, goquery, easyjson, viper, amqp, zap, go-micro,并通过Docker实现容器化部署,中间爬虫节点支持水平拓展。

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

go-crawler-distributed


Github:https://github.com/golang-collection/go-crawler-distributed

Distributed crawler projects, the project supports personalization page parser secondary development, the whole project using micro service architecture, messages are sent asynchronously through message queues, Use the following framework: redigo, gorm, goquery, easyjson, viper, closer, zap, go-micro, and containable deployment is realized through Docker, intermediate crawler nodes support horizontal expansion.

分布式爬虫项目,本项目支持个性化定制页面解析器二次开发,项目整体采用微服务架构,通过消息队列实现消息的异步发送,使用到的框架包括:redigo, gorm, goquery, easyjson, viper, amqp, zap, go-micro,并通过Docker实现容器化部署,中间爬虫节点支持水平拓展。

目录结构

- configs 存放配置文件
- dependencies 项目依赖的环境相关的docker-compose文件
- deploy 脚本
    - buildScript 部署脚本
    - deploy 直接启动项目的脚本
    - dockerBuildScript 构建docker镜像
    - service 用于存放服务的Dockerfile
    - crawlOperation 除存储外统一爬虫处理模块
- global 存放全局统一配置
- init 项目初始化
    - 使用方式 在引入包前面添加 _
- internal 内部包
    - crawler 处理爬虫相关逻辑
        - crawerConfig 爬虫对应的消息队列配置
        - douban 豆瓣网页解析
        - fetcher 网页抓取
        - meituan 美团网页解析
        - persistence 定义用于保存数据的结构体
        - worker 具体的工作逻辑通过它实现代码的解耦
    - dao 与数据库交互
    - model 模型
- pkg 通用包
    - cache 缓存相关 redis
    - consistentHash 一致性Hash
    - db 数据库操作
    - elastic ElasticSearch操作
    - email 邮件相关
    - errcode 全局统一错误配置
    - file 文件相关
    - idGenerator snowflake生成全局ID
    - logger 日志
    - mq 消息队列
    - setting 对应配置文件
    - tracer 调用链追踪
    - upload 文件上传
    - util 相关工具
- service 微服务
    - cache redis微服务通过grpc操作
      - proto 定义grpc的proto文件
      - server 定义grpc的服务端
    - douban 豆瓣相关服务
      - crawl_tags 用于爬取豆瓣的[tags](https://book.douban.com/tag/),此页面为爬虫的起始界面
      - crawl_list 用于爬取豆瓣的[list](https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4)
      - crawl_detail 用于爬取豆瓣的图书具体内容[detail](https://book.douban.com/subject/25955474/)
      - storage_detail 用于存储豆瓣的图书具体内容存储到MySQL中
    - elastic elasticSearch微服务通过grpc操作
      - proto 定义grpc的proto文件
      - server 定义grpc的服务端
    - meituan 美团相关服务
      - crawl_tags 用于爬取美团的[list](https://tech.meituan.com/),此页面为爬虫的起始界面
      - crawl_list 用于爬取美团的[urllist](https://tech.meituan.com/page/2.html)
      - crawl_detail 用于爬取美团技术文章的页面具体内容[detail](https://tech.meituan.com/2020/04/23/octo-watt.html)
      - storage_detail 用于存储美团技术文章的具体内容存储到ElasticSearch中

配置文件

You need to customize your configuration. Create the config.json file under the configs folder in the project root directory or add a config in Consul.

The sample of config.json.

{
  "mysql": {
    "user": "root",
    "password": "example",
    "host": "127.0.0.1:3306",
    "db_name": "short_story"
  },
  "redis": {
    "host": "127.0.0.1:6379"
  },
  "rabbitmq": {
    "user": "guest",
    "password": "guest",
    "host": "127.0.0.1:5672"
  },
  "elastic": {
    "url": "http://127.0.0.1:9200",
    "index": "article"
  }
}

Viper reads the remote configuration file.

config

func init() {
	err := viper.AddRemoteProvider("consul", config.ConsulURL, config.ConsulConfigPath)
	if err != nil {
		logger.Error("read config",zap.Error(err))
		return
	}
	viper.SetConfigType("json") // Need to explicitly set this to json
	if err := viper.ReadRemoteConfig(); err != nil {
		logger.Error("read config",zap.Error(err))
		return
	}
}

More detail: watchConfig

Parser

Parser:parser uses css selectors for page parsing.

douban

Parser: parser uses css selectors for page parsing.

meituan

Framework

framework

Register the micro service in Consul.

service

Architecture

flow

Installation

Deploying projects locally or in the cloud provides two ways:

  • Direct Deploy
  • Docker(Recommended)

Pre-requisite

  • Go 1.13.6
  • Redis 6.0
  • MySQL 5.7
  • RabbitMQ management
  • ElasticSearch
  • Consul
  • Others go.mod

Quick Start

Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.

git clone https://github.com/Knowledge-Precipitation-Tribe/go-crawler-distributed
cd go-crawler-distributed/deploy/buildScript
bash linux_build
cd ../
docker network create -d bridge crawler
docker-compose up -d

The above command has started the basic services, including Consul, Redis, elasticSearch, RabbitMQ, and mysql. RPC services for Redis and elasticSearch are also included.

docker-compose up -d
[GIN-debug] GET    /swagger/*any             --> github.com/swaggo/gin-swagger.CustomWrapHandler.func1 (7 handlers)
[GIN-debug] GET    /sd/health                --> go-crawler-distributed/handler/sd.HealthCheck (7 handlers)
[GIN-debug] GET    /sd/disk                  --> go-crawler-distributed/handler/sd.DiskCheck (7 handlers)
[GIN-debug] GET    /sd/cpu                   --> go-crawler-distributed/handler/sd.CPUCheck (7 handlers)
[GIN-debug] GET    /sd/ram                   --> go-crawler-distributed/handler/sd.RAMCheck (7 handlers)
[GIN-debug] POST   /crawler/douban           --> go-crawler-distributed/handler/crawler.StartDoubanCrawler (7 handlers)
[GIN-debug] POST   /crawler/meituan          --> go-crawler-distributed/handler/crawler.StartMeituanCrawler (7 handlers)
{"level":"info","time":"2020-09-04T11:29:14.954+0800","linenum":"go-crawler-distributed/main.go:59","msg":"Start to listening the incoming requests on http address:80"}
{"level":"info","time":"2020-09-04T11:29:14.956+0800","linenum":"go-crawler-distributed/main.go:55","msg":"The router has been deployed successfully."}

swagger ui:

swagger

Port

  • 6379: redis
  • 3306: mysql
  • 9200, 9300: elasticSearch

elasticSearch

  • 5672, 15672: rabbitMQ

rabbitmq

  • 8500: consul

consul

Run

Basic services

Please use docker-compose to one-click to start up. Create a file named docker-compose.yml and input the code below.

version: "3"

services:

  cache:
    build:
      context: cache
      dockerfile: Dockerfile
    networks:
      - crawler

  elastic:
    build:
      context: elastic
      dockerfile: Dockerfile
    depends_on:
      - elastic_server
    networks:
      - crawler

  redis:
    image: redis
    restart: always
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    networks:
      - crawler

  mysql:
    image: mysql
    command: --default-authentication-plugin=mysql_native_password
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: example
    networks:
      - crawler

  elastic_server:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.8.0
    ports:
      - "9200:9200"
      - "9300:9300"
    volumes:
       - elastic-data:/data
    environment:
      - discovery.type=single-node
    networks:
      - crawler

  rabbitmq:
    image: rabbitmq:management
    hostname: myrabbitmq
    ports:
      - "5672:5672"
      - "15672:15672"
    volumes:
      - rabbitmq-data:/var/lib/rabbitmq
    networks:
      - crawler

  consul1:
    image: consul
    restart: always
    ports:
      - "8500:8500"
      - "8300:8300"
      - "8301:8301"
      - "8302:8302"
      - "8600:8600"
    command: agent -server -bootstrap-expect 2 -ui -bind=0.0.0.0 -client=0.0.0.0
    networks:
      - crawler

  consul2:
    image: consul
    restart: always
    ports:
      - "8501:8500"
    command: agent -server -ui -bind=0.0.0.0 -client=0.0.0.0 -join consul1
    networks:
      - crawler

  consul3:
    image: consul
    restart: always
    ports:
      - "8502:8500"
    command: agent -server -ui -bind=0.0.0.0 -client=0.0.0.0 -join consul1
    networks:
      - crawler


volumes:
  elastic-data:
  rabbitmq-data:
  redis-data:

networks:
  crawler:
    external: true

Then execute the command below, and the basic service will start up.

docker-compose up -d

The above command has started the basic services, including Redis, elasticSearch, RabbitMQ, and mysql. RPC services for Redis and elasticSearch are also included.

crawler service

Create a docker-composite.yml file under the meituan or douban folder.

version: "3"

services:

  crawl_list:
    build:
      context: crawl_list
      dockerfile: Dockerfile
    networks:
      - crawler

  crawl_tags:
    build:
      context: crawl_urllist
      dockerfile: Dockerfile
    networks:
      - crawler

  crawl_detail:
    build:
      context: crawl_detail
      dockerfile: Dockerfile
    networks:
      - crawler

  storage_detail:
    build:
      context: storage_detail
      dockerfile: Dockerfile
    networks:
      - crawler

networks:
  crawler:
    external: true

Then execute the command below, and the crawler service will start up.

docker-compose up -d

If you want to start multiple crawler nodes you can use the following command.

docker-compose up --scale crawl_list=3 -d

Direct

Start the service directly, but be aware that you need to rely on redis, RabbitMQ and other services

cd deploy/deploy/
bash start-meituan-direct.sh

注意

本项目是通过go-micro的方式将其封装为了微服务,但是个人认为对于一个单独的redis来说不应该将其封装为微服务,微服务的划分应该是从业务的角度出发。

Appendix

License

MIT

Copyright (c) 2020 Knowledge-Precipitation-Tribe

About

分布式爬虫项目,本项目支持个性化定制页面解析器二次开发,项目整体采用微服务架构,通过消息队列实现消息的异步发送,使用到的框架包括:redigo, gorm, goquery, easyjson, viper, amqp, zap, go-micro,并通过Docker实现容器化部署,中间爬虫节点支持水平拓展。

License:MIT License


Languages

Language:Go 95.8%Language:Shell 3.2%Language:Dockerfile 1.0%