This serves a very specific purpose: checking whether links between pages within a mediawiki wiki will work as expected:
- checks whether internal mediawiki links with an #anchor match a header in the target page
- checks duplicate section names
First version, doesn't do everyting I want yet.
Does not persist anything, which implies it is only reasonable to run on small wikis.
It checks my ~1000-page wiki in a minute, which is more than good enough for me. ...but you really DON'T want to run this against wikipedia. You'ld want a lot more work to, say, not fetch six million pages every run.
- more link normalization - based on what mediawiki actually does for you (it's not documented, bleh) because right now it mentions some links that do actually work as broken
- consider anchors in the wiki text to be targets too
- pywikibot, which is doing most of the heavy lifting,
- networkx, which makes it easier to select links we can while we're still fetching
Which is covered by pip3 install networkx pywikibot
Create a user-config.py
in the same directory. Mine looks something like
mylang = 'en'
family = 'placeholder'
usernames['placeholder']['en'] = 'TestingAnchorBot'
family_files['placeholder'] = 'https://wiki.example.org/api.php'
For some more explanation, see e.g. https://www.mediawiki.org/wiki/Manual:Pywikibot/user-config.py and https://www.mediawiki.org/wiki/Manual:Pywikibot/Use_on_third-party_wikis
- stop complaining about non-encoded anchors - mediawiki functins fine with those
- same for redirects, because both name and anchor can vary
- include checks of redirects, in a sensibly-reported way. Including broken redirects.
- add command line parameters
CONSIDER
- discover page names as we go (right now we hope that prefix search for [a-z0-9] gets most things)
- remove the threading, it's probably not worth the complexity. This is one example where back-and-forth sort of concurrency is good enough