Soryu23 / weibo-crawler

Weibo-crawler is a crawler project based on golang colly framework to crawl weibo sites and get information. It crawls web content by regular expressions and Xpath selector, spatially transforms keywords using word vector model, and clusters text content by HDBSCAN clustering algorithm.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

go_crawler

Go_crawler is a crawler project based on golang colly framework to crawl weibo sites and get information. It crawls web content by regular expressions and Xpath selector, spatially transforms keywords using word vector model, and clusters text content by HDBSCAN clustering algorithm.

Features

Comprehensive capture of user information
Multi-dimensional collection of weibo content
Timed incremental acquisition
Keyword Cluster Analysis
Category hotspot sorting

Go_crawler is based on following tools

name description
Go An open source programming language that makes it easy to build simple, reliable, and efficient software.
Python Python is an interpreted, high-level and general-purpose programming language.
Gin Web struct based on Go, flexible middleware,strong data binding and outstanding performance.
Ginkgo Ginkgo builds on Go's testing package, allowing expressive Behavior-Driven Development ("BDD") style tests.
Colly Lightning Fast and Elegant Scraping Framework for Gophers
Postgres The world's most advanced open source relational database
Gorm The fantastic ORM library for Golang aims to be developer friendly.
Redis An open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker.
Docker Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.
Sklearn Simple and efficient tools for predictive data analysis
Gensim The fastest library for training of vector embeddings – Python or otherwise.
HDBSCAN HDBSCAN is a clustering algorithm extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

Why not python

Python has more than one set of mature crawler frameworks such as scrapy, pyspider and so on.They have excellent runtime mechanism and powerful capabilities. But when the anti-crawler mechanism is strong, rewriting the middleware is a very difficult task. And it's not flexible enough to be accessed by a project system .

Struct

.
├── application.yml  
├── args
│   ├── args.go
│   └── cmd.go
├── conf  
│   ├── conf_debug.go
│   ├── conf.go
│   └── conf_release.go
├── controller
│   ├── application.go
│   ├── blogger.go
│   ├── category.go
│   ├── error.go
│   ├── query.go
│   ├── tag.go
│   └── task.go
├── corpus  
│   └── corpus.txt
├── db  
│   └── db.go
├── go.mod
├── go.sum
├── jwt  
│   └── jwt.go
├── main.go
├── Makefile  
├── models  
│   ├── base_model.go
│   ├── blog.go
│   ├── blogger.go
│   ├── category.go
│   ├── tag.go
│   └── user.go
├── python  
│   ├── dict.txt
│   ├── keywords.txt
│   ├── keywords_demo.py
│   └── save_cookies.go
├── README.md
├── redis
│   └── redis.go
├── routers  
│   └── router.go
├── tasks
│   ├── tags.go
│   └── tasks.go
├── test
└── util
    ├── agent.go
    ├── cookie.go
    ├── cookies.txt
    └── util.go

How to use

  1. install tools and dependency mentioned above
  2. config application.yml, establish connection
  3. go run main.go -db create
  4. go run main.go -db migrate
  5. go run main.go
  6. Add bloggers & keywords post /add_bloggers, /tags/set_keywords and /tags/cache_keywords
  7. wait 30 minutes or call /task(local debug environment)
  8. let the bullets fly
  9. Post /query_blogs to show datas

if you want to do cluster, post /tags/keywords, download corpus, python keywords.py, adjust and post /category/set Get /category/query to show hot topics

Api list

details


API CALL ROUTER FUNCTION
Ping GET /ping ping
Task GET /task (auto run every 30 minutes) crawler task
Query_blogs POST /query_blogs query according to different parameters
Add_bloggers POST /add_bloggers add bloggers in task list
Set_category GET /category/set set category by clustering result
Set_category_name POST /category/set_name rename category
Query_category GET /category/query query category
Query_tags GET /tags/query query tags
Cache_keywords POST /tags/cache_keywords save keywords to redis
Get_keywords POST /tags/keywords query keywords and write to txt for clustering
Set_keywords POST /tags/set_keywords add keywords as tags

About

Weibo-crawler is a crawler project based on golang colly framework to crawl weibo sites and get information. It crawls web content by regular expressions and Xpath selector, spatially transforms keywords using word vector model, and clusters text content by HDBSCAN clustering algorithm.

License:MIT License


Languages

Language:Go 91.5%Language:Python 6.3%Language:Makefile 2.2%