A short script to extract the most linked domains in references of a given Wikimedia site
This isn't meant to be pretty or optimized. It was a quick-and-dirty means to an end. Standard disclaimer applies.
This is mostly for my own reference, but it might be useful in case I get hit by a bus.
- Obtain an XML dump of the Wikimedia wiki you want to analyze. You want the
xxwiki*-2015xxxx-pages-articles*.xml*.bz2
file(s). - Run the mwrefs script on it to extract the references. Example:
nice ./utility extract frwiki-20150512-pages-articles?.xml.bz2 | bzip2 -c > mwrefs-frwiki-20150512.bz2
- Extract the archive to a
.tsv
file, and run the script: ./refsdomains.sh
- The results are in the
sorted_domain_list.txt
.
This was done as part of a short research project looking into how well Citoid supported references commonly used on Wikimedia sites.