Simple CLI utility to download web pages or entire sites to a local folder based on a YAML configuration.
See the example in config.example.yaml. Structure:
http:
timeout_sec: 20 # HTTP client timeout, seconds
user_agent: "go_crawler/0.1 (+https://example.local)"
max_retries: 2 # number of retries on errors
retry_backoff_ms: 300 # backoff between retries (increasing per attempt)
# Optional number of concurrent workers; <=0 means auto (computed from CPU)
workers: 0
# Optional list of proxies to rotate between (http, https, socks5, socks5h)
# Example:
# - "http://127.0.0.1:8080"
# - "http://user:pass@127.0.0.1:8080"
# - "socks5://127.0.0.1:1080"
# - "socks5://user:pass@127.0.0.1:1080"
proxies: []
# Set of crawl jobs. Each job can be enabled/disabled, has a type and a list of URLs
items:
# 1) Download a single page (type: page)
- enabled: true
type: page # page | pages | site
urls:
- "https://example.com/"
output_dir: "out/page_example"
include_assets: true
asset_types: ["css", "js", "img"] # css, js, img, font, media, other
same_host_only: true
max_depth: 0 # not used for page
max_pages: 0 # not used for page
# 2) Crawl multiple pages by following links on the same host (type: pages)
- enabled: false
type: pages
urls:
- "https://example.com/"
output_dir: "out/pages_example"
include_assets: true
asset_types: ["css", "js", "img", "font"]
same_host_only: true
max_depth: 2 # crawl depth
max_pages: 50 # limit number of pages
# 3) Save a site (similar to pages, typically with same_host_only enabled)
- enabled: false
type: site
urls:
- "https://example.com/"
output_dir: "out/site_example"
include_assets: true
asset_types: ["css", "js", "img", "font", "media", "other"]
same_host_only: true
max_depth: 3
max_pages: 200Field reference:
- http
- timeout_sec: integer seconds (default 20).
- user_agent: string for User-Agent header.
- max_retries: non-negative integer (default 0).
- retry_backoff_ms: positive integer milliseconds (default 250).
- workers: number of concurrent workers; <=0 auto-computed by CPU. Default formula: min(max(4, NumCPU*4), 64).
- proxies: list of proxy URLs to rotate between; supports http, https, socks5, socks5h. Empty or omitted means direct connection.
- items[] (job)
- enabled: bool to toggle job.
- type: "page" | "pages" | "site".
- urls: list of absolute URLs to start from.
- output_dir: target directory for saving files.
- include_assets: download and rewrite assets in HTML.
- asset_types: subset of ["css","js","img","font","media","other"]; "other" covers unrecognized assets (e.g., .ico, manifest.json via link); used only if include_assets is true. Default ["css","js","img"].
- same_host_only: restrict crawl to hosts derived from the seed URLs (recommended for "site").
- max_depth: integer depth limit for BFS (0 means unlimited).
- max_pages: integer total page limit (0 means unlimited).
- Output files are stored under:
<output_dir>/<host>/<path...>. - Directory-like URLs (ending with
/) becomeindex.htmlwhen downloading HTML. - URLs without extension become
<name>.htmlfor HTML pages. - Non-HTML assets keep their extension; if query string is present, an 8-char SHA1 of the query is appended before the extension:
name-xxxxxxxx.ext.
Examples:
- https://example.com/ →
out/.../example.com/index.html - https://example.com/about →
out/.../example.com/about.html - https://static.example.com/main.css?v=123 →
out/.../static.example.com/main-<hash>.css
- BFS starting from the provided URLs (for "pages"/"site" types).
- Only http/https links considered; fragments ignored for uniqueness.
- Asset references rewritten inside HTML:
- Attributes: href (link), src (script/img/source/video/audio), srcset (img/source).
- Rewritten to relative paths from the saved HTML location.
- Data URLs are ignored.
- Host restriction via same_host_only.
MIT