crawley

Crawls web pages and prints any link it can find.

features

fast html SAX-parser (powered by golang.org/x/net/html)
small (<1000 SLOC), idiomatic, 100% test covered codebase
grabs most of useful resources urls (pics, videos, audios, etc...)
found urls are streamed to stdout and guranteed to be unique
scan depth (limited by starting host and path, by default - 0) can be configured
can crawl robots.txt rules and sitemaps
brute mode - scan html comments for urls (this can lead to bogus results)
make use of HTTP_PROXY / HTTPS_PROXY environment values

installation

binaries for Linux, macOS and Windows

usage

crawley [flags] url

possible flags:

-brute
    scan html comments
-delay duration
    per-request delay (default 250ms)
-depth int
    scan depth, set to -1 for unlimited
-help
    this flags (and their defaults) description
-robots string
    action for robots.txt: ignore/crawl/respect (default "ignore")
-silent
    suppress info and error messages in stderr
-skip-ssl
    skip ssl verification
-user-agent string
    user-agent string
-version
    show version
-workers int
    number of workers

juxuanu / crawley

crawley

features

installation

usage

About

Languages