tohn / hydra-link-checker

Hydra: a multithreaded site-crawling link checker in Python standard library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hydra: multithreaded site-crawling link checker in Python

Tests status badge

A Python program that crawls slithers 🐍 a website for links and prints a YAML report of broken links.

Requires

Python 3.6 or higher.

There are no external dependencies, Neo.

Docker

If you don't have Python installed, you can also use a docker image:

docker pull ghcr.io/tohn/hydra-link-checker

Usage

$ python hydra.py -h
usage: hydra.py [-h] [--config CONFIG] URL

With Docker:

$ docker run --rm -it ghcr.io/tohn/hydra-link-checker python hydra.py -h
usage: hydra.py [-h] [--config CONFIG] URL

Positional arguments:

  • URL: The URL of the website to crawl. Ensure URL is absolute including schema, e.g. https://example.com.

Optional arguments:

  • -h, --help: Show help message and exit
  • --config CONFIG, -c CONFIG: Path to a configuration file

A broken links report will be output to stdout, so you may like to redirect this to a file.

The report will be YAML formatted. To save the output to a file, run:

python hydra.py [URL] > [PATH/TO/FILE.yaml]

You can add the current date to the filename using a command substitution, such as:

python hydra.py [URL] > /path/to/$(date '+%Y_%m_%d')_report.yaml

To see how long Hydra takes to check your site, add time:

time python hydra.py [URL]

GitHub Action

You can easily incorporate Hydra as part of an automated process using the link-snitch action.

Configuration

Hydra can accept an optional JSON configuration file for specific parameters, for example:

{
    "OK": [
        200,
        999,
        403
    ],
    "attrs": [
        "href"
    ],
    "exclude_scheme_prefixes": [
        "tel"
    ],
    "tags": [
        "a",
        "img"
    ],
    "threads": 25,
    "timeout": 30
}

To use a configuration file, supply the filename:

python hydra.py https://example.com --config ./hydra-config.json

With Docker:

docker run --rm -v $(pwd):/opt -it ghcr.io/tohn/hydra-link-checker python hydra.py https://example.com --config ./hydra-config.json

Possible settings:

  • OK - HTTP response codes to consider as a successful link check. Defaults to [200, 999].
  • attrs - Attributes of the HTML tags to check for links. Defaults to ["href", "src"].
  • exclude_scheme_prefixes - HTTP scheme prefixes to exclude from checking. Defaults to ["tel:", "javascript:"].
  • tags - HTML tags to check for links. Defaults to ["a", "link", "img", "script"].
  • threads - Maximum workers to run. Defaults to 50.
  • timeout - Maximum seconds to wait for HTTP response. Defaults to 60.

Test

Run:

python -m unittest tests/test.py

About

Hydra: a multithreaded site-crawling link checker in Python standard library

License:MIT License


Languages

Language:Python 89.5%Language:HTML 10.2%Language:Dockerfile 0.3%