gpaumier / refsdomains

a few scripts to extract the most linked domains in references of a given Wikimedia site

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

refsdomains

A short script to extract the most linked domains in references of a given Wikimedia site

This isn't meant to be pretty or optimized. It was a quick-and-dirty means to an end. Standard disclaimer applies.

Usage

This is mostly for my own reference, but it might be useful in case I get hit by a bus.

  • Obtain an XML dump of the Wikimedia wiki you want to analyze. You want the xxwiki*-2015xxxx-pages-articles*.xml*.bz2 file(s).
  • Run the mwrefs script on it to extract the references. Example:
  • nice ./utility extract frwiki-20150512-pages-articles?.xml.bz2 | bzip2 -c > mwrefs-frwiki-20150512.bz2
  • Extract the archive to a .tsv file, and run the script:
  • ./refsdomains.sh
  • The results are in the sorted_domain_list.txt.

Background

This was done as part of a short research project looking into how well Citoid supported references commonly used on Wikimedia sites.

About

a few scripts to extract the most linked domains in references of a given Wikimedia site

License:MIT License


Languages

Language:Python 81.0%Language:Shell 19.0%