YAML files for including and excluding things
See the browsertrix project for more general documentation.
This includes all sub domains like abc.collection.litme.com.ua, handles http:// to https:// conversions & subdomains.
collection: "collection-litme-com-ua"
workers: 16
saveState: always
# A few examples that exclude parts of a page from being fetched and recorded. Feel free to add more
blockRules:
- url: google-analytics.com
- url: googletagmanager.com
- url: yandex.ru
- url: mycounter.ua
- url: facebook.(com|net)
#- url: youtube.com/embed/ # Uncomment this line to skip the recording of embedded YouTube videos
seeds:
- url: http://collection.litme.com.ua/
scopeType: "domain"
If you notice that a crawl is collecting duplicate links due to parameters, like
- '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2022-4-1","seedId":0,"depth":2}'
- '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2023-03-1","seedId":0,"depth":2}'
- '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2022-2-1","seedId":0,"depth":2}'
where each is the same base page when loaded in the browser, then first check that excluding it results in the same webpage. in this case, as https://nmiu.org/index.php/novyny-museum goes to the same webpage, we can exclude all cases where iccaldate= is used as a query parameter / link to follow by adding the following to our yaml file:
exclude:
- .*iccaldate=.*
This will prevent the same page taking up multiple workers / space in the final file / time. So the whole file will read
collection: "collection-nmiu-org"
workers: 16
saveState: always
seeds:
- url: https://nmiu.org
include: .*nmiu\.org.*
scopeType: "host"
exclude:
- .*iccaldate=.*
Some websites have recursive links. They can look like this
'{"url":"https://nmiu.org/lektoriy/181-ekskursiji-kontent/vysta/3/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/exponat-tyzhnya","seedId":0,"depth":48}'
If you notice this kind of pattern in the URLs of a site that never seems to end, avoid getting stuck in such a trap by adding the "depth" flag to the yaml file:
collection: "collection-litme-com-ua"
workers: 16
saveState: always
seeds:
- url: http://collection.litme.com.ua/
include:
- .*collection\.litme\.com.ua.*
scopeType: "host"
depth: 25
This will tell the tool to only follow a given path to a depth of 25 different URL clicks, so use it with care!
In the example given at the start of this section, exclude can also be used with
exclude:
- .*index.php/index.php.*
as this was what was causing the recursion