ubalklen / Broken-Link-Finder

Python script to find broken links on a site

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Broken Link Finder

A Python script that scans a web page of a given URL and validates the links on it.

If the page has a link to the same hostname as the URL given by the user, its destination page also scanned. It works like a web crawler, so potentially the whole site will be scanned. Be aware that this may take hours.

All broken links found are saved on broken-links-[date]-[time]-[random-ID].txt.

Requisites

Chrome 75 (put the proper driver in drivers folder if you have another version)

Python 3 (it has been tested on Python 3.7)

Some additional Python modules (check the script)

Details

The script scans pages using Selenium to account for links that may be injected via JavaScript.

Then, each found link is validate with Requests in a concurrent (multi-thread) fashion.

If you don't care about rendering JavaScript, the find_broken_links_req.py script doesn't use Selenium and thus is slightly faster.

A single-thread script is provided for benchmarking.

TODO

  • Better exception handling
  • Validate links in other tags besides <href a=...> (like <img src=...>)
  • Timeout
  • Link depth limit
  • Proxy support
  • NTLM support
  • Selenium driver parallelization

About

Python script to find broken links on a site

License:MIT License


Languages

Language:Python 100.0%