go_crawler

Go_crawler is a crawler project based on golang colly framework to crawl weibo sites and get information. It crawls web content by regular expressions and Xpath selector, spatially transforms keywords using word vector model, and clusters text content by HDBSCAN clustering algorithm.

Features

Comprehensive capture of user information
Multi-dimensional collection of weibo content
Timed incremental acquisition
Keyword Cluster Analysis
Category hotspot sorting

Go_crawler is based on following tools

name	description
Go	An open source programming language that makes it easy to build simple, reliable, and efficient software.
Python	Python is an interpreted, high-level and general-purpose programming language.
Gin	Web struct based on Go, flexible middleware，strong data binding and outstanding performance.
Ginkgo	Ginkgo builds on Go's testing package, allowing expressive Behavior-Driven Development ("BDD") style tests.
Colly	Lightning Fast and Elegant Scraping Framework for Gophers
Postgres	The world's most advanced open source relational database
Gorm	The fantastic ORM library for Golang aims to be developer friendly.
Redis	An open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker.
Docker	Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.
Sklearn	Simple and efficient tools for predictive data analysis
Gensim	The fastest library for training of vector embeddings – Python or otherwise.
HDBSCAN	HDBSCAN is a clustering algorithm extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

Why not python

Python has more than one set of mature crawler frameworks such as scrapy, pyspider and so on.They have excellent runtime mechanism and powerful capabilities. But when the anti-crawler mechanism is strong, rewriting the middleware is a very difficult task. And it's not flexible enough to be accessed by a project system .

Struct

.
├── application.yml  
├── args
│   ├── args.go
│   └── cmd.go
├── conf  
│   ├── conf_debug.go
│   ├── conf.go
│   └── conf_release.go
├── controller
│   ├── application.go
│   ├── blogger.go
│   ├── category.go
│   ├── error.go
│   ├── query.go
│   ├── tag.go
│   └── task.go
├── corpus  
│   └── corpus.txt
├── db  
│   └── db.go
├── go.mod
├── go.sum
├── jwt  
│   └── jwt.go
├── main.go
├── Makefile  
├── models  
│   ├── base_model.go
│   ├── blog.go
│   ├── blogger.go
│   ├── category.go
│   ├── tag.go
│   └── user.go
├── python  
│   ├── dict.txt
│   ├── keywords.txt
│   ├── keywords_demo.py
│   └── save_cookies.go
├── README.md
├── redis
│   └── redis.go
├── routers  
│   └── router.go
├── tasks
│   ├── tags.go
│   └── tasks.go
├── test
└── util
    ├── agent.go
    ├── cookie.go
    ├── cookies.txt
    └── util.go

How to use

install tools and dependency mentioned above
config application.yml, establish connection
go run main.go -db create
go run main.go -db migrate
go run main.go
Add bloggers & keywords post /add_bloggers, /tags/set_keywords and /tags/cache_keywords
wait 30 minutes or call /task(local debug environment)
let the bullets fly
Post /query_blogs to show datas

if you want to do cluster, post /tags/keywords, download corpus, python keywords.py, adjust and post /category/set Get /category/query to show hot topics

Api list

details

API	CALL	ROUTER	FUNCTION
Ping	GET	/ping	ping
Task	GET	/task	(auto run every 30 minutes) crawler task
Query_blogs	POST	/query_blogs	query according to different parameters
Add_bloggers	POST	/add_bloggers	add bloggers in task list
Set_category	GET	/category/set	set category by clustering result
Set_category_name	POST	/category/set_name	rename category
Query_category	GET	/category/query	query category
Query_tags	GET	/tags/query	query tags
Cache_keywords	POST	/tags/cache_keywords	save keywords to redis
Get_keywords	POST	/tags/keywords	query keywords and write to txt for clustering
Set_keywords	POST	/tags/set_keywords	add keywords as tags

About

Weibo-crawler is a crawler project based on golang colly framework to crawl weibo sites and get information. It crawls web content by regular expressions and Xpath selector, spatially transforms keywords using word vector model, and clusters text content by HDBSCAN clustering algorithm.

MIT License

Languages

Language:Go 91.5%Language:Python 6.3%Language:Makefile 2.2%