juxuanu / crawley

The unix-way web crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Go Report Card Maintainability Test Coverage

License Go Version Release Downloads

crawley

Crawls web pages and prints any link it can find.

features

  • fast html SAX-parser (powered by golang.org/x/net/html)
  • small (<1000 SLOC), idiomatic, 100% test covered codebase
  • grabs most of useful resources urls (pics, videos, audios, etc...)
  • found urls are streamed to stdout and guranteed to be unique
  • scan depth (limited by starting host and path, by default - 0) can be configured
  • can crawl robots.txt rules and sitemaps
  • brute mode - scan html comments for urls (this can lead to bogus results)
  • make use of HTTP_PROXY / HTTPS_PROXY environment values

installation

usage

crawley [flags] url

possible flags:

-brute
    scan html comments
-delay duration
    per-request delay (default 250ms)
-depth int
    scan depth, set to -1 for unlimited
-help
    this flags (and their defaults) description
-robots string
    action for robots.txt: ignore/crawl/respect (default "ignore")
-silent
    suppress info and error messages in stderr
-skip-ssl
    skip ssl verification
-user-agent string
    user-agent string
-version
    show version
-workers int
    number of workers

About

The unix-way web crawler

License:MIT License


Languages

Language:Go 97.7%Language:Makefile 2.3%