Track changes on websites via git
This tool checks all the websites listed in its config. When a change is detected, the new site is added to a git commit. It can then be inspected via normal git tooling.
Basically it's curl
, sed
++ and then git commit
in a neat package.
See it in action (literally in GitHub Actions).
- GitHub Releases
- Arch Linux User Repository (AUR)
- Docker Hub Image
- Via rust and cargo: Clone →
cargo install --path .
Check out website-stalker-example which runs within GitHub actions.
-
First create a new folder / repo for tracking website changes
mkdir personal-stalker cd personal-stalker website-stalker init
website-stalker init
will create a git repo (git init
) and the example config (website-stalker example-config > website-stalker.yaml
) for you. -
Add your favorite website to the configuration file
website-stalker.yaml
. Also make sure to set the value of from to an email address of yours.website-stalker example-config > website-stalker.yaml nano website-stalker.yaml
-
Check if your config is valid
website-stalker check
-
Run your newly added website. If you added
https://apple.com/newsroom
use something like this to test if everything works like you want:website-stalker run apple
-
Set up a cronjob / systemd.timer executing the following command every now and then
website-stalker run --all --commit
The config describes a list of sites. Each site has a URL. Additionally, each site can have editors which are used before saving the file. Each editor manipulates the content of the URL.
# This is an example config
# The filename should be `website-stalker.yaml`
# and it should be in the working directory where you run website-stalker.
#
# For example run `website-stalker example-config > website-stalker.yaml`.
# Adapt the config to your needs and set the FROM email address which is used as a request header:
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/From
#
# And then do a run via `website-stalker run --all`.
---
from: my-email-address
sites:
- url: "https://edjopato.de/post/"
editors:
- css_select: article
- css_remove: a
- html_prettify
- regex_replace:
pattern: "(Lesezeit): \\d+ \\w+"
replace: $1
- url: "https://edjopato.de/robots.txt"
There is a bigger config in my example repo. The example repo is also used by me to detect changes of interesting sites.
Options which are globally configured at the root level of the configuration file website-stalker.yaml
.
Used as the From
header in the web requests.
It is a required field.
The idea here is to provide a way for a website host to contact whoever is doing something to their web server. As this tool is self-hosted and can be run as often as the user likes this can annoy website hosts. While this tool is named "stalker" and is made to track websites it is not intended to annoy people.
This tool sets the User-Agent
header always to website-stalker/<version> https://github.com/EdJoPaTo/website-stalker
and the From
header to the config value.
This way both the creator and the user of this tool can be reached in case of problems.
from: my-email-address
Alternatively you can specify FROM via environment variable
export WEBSITE_STALKER_FROM=my-email-address
When using the notifications you might want to use your own style of notification instead of the default one. You can specify your own template which is handled via the Mustache Syntax. The following example contains all currently available data points.
When writing your own template use website-stalker check
to ensure the template will work.
notification_template: |
These {{siteamount}} sites changed:
{{#sites}}
- {{.}}
{{/sites}}
The following hosts are involved:
{{#hosts}}
- {{.}}
{{/hosts}}
{{#singlehost}}
All changes happened on only one host: {{singlehost}}
{{/singlehost}}
{{^singlehost}}
The changes happened on various hosts.
{{/singlehost}}
The {{commit}} contains all these changes.
Options available per site besides the editors which are explained below.
One or multiple URLs can be specified. The simple form is a single URL:
sites:
- url: "https://edjopato.de/"
- url: "https://edjopato.de/post/"
It's also possible to specify multiple URL at the same time. This is helpful when multiple sites are sharing the same options (like editors).
sites:
- url:
- "https://edjopato.de/"
- "https://edjopato.de/post/"
Allows HTTPS connections with self-signed or invalid / expired certificates.
From reqwests documentation:
You should think very carefully before using this method. If invalid certificates are trusted, any certificate for any site will be trusted for use. This includes expired certificates. This introduces significant vulnerabilities, and should only be used as a last resort.
Do you have a need for self-signed certificates or the usage of the system certificate store? Please share about it in Issue #39.
sites:
- url: "https://edjopato.de/post/"
accept_invalid_certs: true
Only show warning when the site errors.
This is useful for buggy services which are sometimes just gone or for pages which will exist in the future but are not there yet. Personal example: A bad DNS configuration which lets the website appear nonexistent for some time.
This setting also skips errors from editors.
sites:
- url: "https://edjopato.de/might-appear-in-the-future"
ignore_error: true
Overrides the URL based default filename of the site.
Normally the filename is automatically derived from the url.
For the following example it would be something like de-edjopato-api-token-0123456789-action-hack-20the-20planet.html
.
With the filename
options it is saved as de-edjopato-api-planet-hack.html
instead.
sites:
- url: "https://edjopato.de/api?token=0123456789&action=hack%20the%20planet"
filename: de-edjopato-api-planet-hack
Add additional HTTP headers to the request to the given site.
This is useful for sites that respond differently based on different headers.
Each header Key/Value pair is supplied as YAML String separated with a :
followed by a space in the config.
This is the same syntax as HTTP uses which sadly collides with YAML.
YAML assumes something with a :
is an object.
Therefor you have to make sure to quote the headers.
Using a YAML object / key/value pair is also not possible as some header keys are allowed multiple times.
sites:
- url: "https://edjopato.de/"
headers:
- "Cache-Control: no-cache"
- "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:106.0) Gecko/20100101 Firefox/106.0"
Editors are manipulating the content of a webpage to simplify comparing them later on.
For example: If you are interested in the content of a webpage the <head>
with changing stylesheets isn't interesting to you.
When keeping it, it will still create diffs which end up being commits.
This will create noise you're probably just going to ignore.
That's why editors exist.
Think of editors like a pipeline, the next one gets the input of the one before.
As some editors are assuming HTML input, they won't work (well) with non HTML input.
For example its kinda useless to use html_prettify
after html_textify
as text won't end up being pretty HTML.
For this reason editors like css_select
are still producing valid HTML output.
There are probably more tasks out there that might be useful as editors. Feel free to provide an issue for an editor idea or create a Pull Request with a new editor.
Tries to remove every instance of matching HTML elements and returns the remaining HTML.
Opposite of css_select
.
Examples:
editors:
- css_remove: article
- css_remove: h1 a
- css_remove: h1 > a
Use CSS Selectors to grab every instance of matching HTML elements and returns all of them.
If no matching HTML elements are found, this editor errors.
Examples:
editors:
- css_select: article
- css_select: h1 a
- css_select: h1 > a
Formats the input HTML as Markdown.
This is rather simple right now. Please report issues you find.
Example:
editors:
- html_markdownify
Formats the input HTML as pretty HTML.
Example:
editors:
- html_prettify
Strip down HTML to its minimal form.
Example:
editors:
- html_sanitize
Only returns text content of HTML elements within the input.
Example:
editors:
- html_textify
Parses the input HTML for URLs. URLs are parsed into their canonical, absolute form.
Example:
editors:
- html_url_canonicalize
Formats the input JSON as pretty JSON.
Example:
editors:
- json_prettify
Searches the input with a Regex pattern and replaces all occurrences with the given replace phrase.
Grouping and replacing with $1
also works.
Examples:
editors:
# Remove all occurences of that word
- regex_replace:
pattern: "tree"
replace: ""
# Remove all numbers
- regex_replace:
pattern: "\\d+"
replace: ""
# Find all css files and remove the extension
- regex_replace:
pattern: "(\\w+)\\.css"
replace: $1
Creates an RSS 2.0 Feed from the input.
An RSS item is generated for every item_selector
result.
The other selectors can be used to find relevant information of the items.
The content is the full result of the item_selector
.
It can be further edited with every available editor.
Defaults:
title
: When a<title>
exists, it will be used. Otherwise, it's empty.item_selector
:article
title_selector
:h2
link_selector
:a
content_editors
can be omitted when empty
Examples:
# Fully specified example
- url: "https://edjopato.de/post/"
editors:
- rss:
title: EdJoPaTos Blog
item_selector: article
title_selector: h2
link_selector: a
content_editors:
- css_remove: "h2, article > a, div"
- html_textify
# Minimal working example
- url: "https://edjopato.de/post/"
editors:
- rss: {}
When changes on websites are detected they get saved to filesystem.
When --commit
is given a git commit is created.
Additionally, you can get notified via Telegram, Slack, E-Mail, ... pling is used to send these notifications. Check its documentation about which environment variables to specify in order to get notifications.
Example with Telegram:
export TELEGRAM_BOT_TOKEN='123:ABC'
export TELEGRAM_TARGET_CHAT='1234'
website-stalker run --all
- Website Changed Bot is a Telegram Bot which might potentially use this tool later on
- bernaferrari/ChangeDetection is an Android app for this
- dgtlmoon/changedetection.io can be selfhosted and configured via web interface
- Feed me up, Scotty! creates RSS feeds from websites
- htmlq command line tool to format / select html (like jq for html)
- urlwatch