Hydra: multithreaded site-crawling link checker in Python
A Python program that crawls slithers 🐍 a website for links and prints a YAML report of broken links.
Requires
Python 3.6 or higher.
There are no external dependencies, Neo.
Docker
If you don't have Python installed, you can also use a docker image:
docker pull ghcr.io/tohn/hydra-link-checker
Usage
$ python hydra.py -h
usage: hydra.py [-h] [--config CONFIG] URL
With Docker:
$ docker run --rm -it ghcr.io/tohn/hydra-link-checker python hydra.py -h
usage: hydra.py [-h] [--config CONFIG] URL
Positional arguments:
URL
: The URL of the website to crawl. EnsureURL
is absolute including schema, e.g.https://example.com
.
Optional arguments:
-h
,--help
: Show help message and exit--config CONFIG
,-c CONFIG
: Path to a configuration file
A broken links report will be output to stdout, so you may like to redirect this to a file.
The report will be YAML formatted. To save the output to a file, run:
python hydra.py [URL] > [PATH/TO/FILE.yaml]
You can add the current date to the filename using a command substitution, such as:
python hydra.py [URL] > /path/to/$(date '+%Y_%m_%d')_report.yaml
To see how long Hydra takes to check your site, add time
:
time python hydra.py [URL]
GitHub Action
You can easily incorporate Hydra as part of an automated process using the link-snitch action.
Configuration
Hydra can accept an optional JSON configuration file for specific parameters, for example:
{
"OK": [
200,
999,
403
],
"attrs": [
"href"
],
"exclude_scheme_prefixes": [
"tel"
],
"tags": [
"a",
"img"
],
"threads": 25,
"timeout": 30
}
To use a configuration file, supply the filename:
python hydra.py https://example.com --config ./hydra-config.json
With Docker:
docker run --rm -v $(pwd):/opt -it ghcr.io/tohn/hydra-link-checker python hydra.py https://example.com --config ./hydra-config.json
Possible settings:
OK
- HTTP response codes to consider as a successful link check. Defaults to[200, 999]
.attrs
- Attributes of the HTML tags to check for links. Defaults to["href", "src"]
.exclude_scheme_prefixes
- HTTP scheme prefixes to exclude from checking. Defaults to["tel:", "javascript:"]
.tags
- HTML tags to check for links. Defaults to["a", "link", "img", "script"]
.threads
- Maximum workers to run. Defaults to50
.timeout
- Maximum seconds to wait for HTTP response. Defaults to60
.
Test
Run:
python -m unittest tests/test.py