matthewrudy / gocrawler

A Golang web crawler implemented for a coding test

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gocrawler

Implemented features:

  • Static HTML parser
  • Retries
  • Parallel

Planned features:

  • Maximum Depth

Missing features:

  • Respect robots.txt
  • Look for Sitemap in robots.txt
  • Backoff Retries
  • Render javascript (chrome headless?)
  • Extract assets added by CSS

Demo:

# install
$ go get -u github.com/matthewrudy/gocrawler/...

# crawl http://tomblomfield.com
$ gocrawler git:(master) gocrawler
success: http://tomblomfield.com/
success: http://tomblomfield.com/about
success: http://tomblomfield.com/rss
success: http://tomblomfield.com/day/2015/12/13
...
http://tomblomfield.com/random
 - http://www.gravatar.com/avatar/c833be5582482777b51b8fc73e8b0586?s=128&d=identicon&r=PG
 - http://78.media.tumblr.com/ddebec46b60f554989f09682fc3d8e71/tumblr_inline_mtj697fPI11r5tr1m.jpg

http://tomblomfield.com/rss

About

A Golang web crawler implemented for a coding test


Languages

Language:HTML 90.7%Language:Go 9.3%Language:Makefile 0.0%