jq-wikidata
jq module to process Wikidata JSON format
This git repository contains a module for the jq data transformation language to process entity data from Wikidata or other Wikibase instances serialized in its JSON format.
Several methods exist to get entity data from Wikidata. This module is designed to process entities in their JSON serialization especially for large numbers of entities. Please also consider using a dedicated client such as wikidata-cli instead.
Table of Contents
Install
Installation requires jq version 1.5 or newer.
Put wikidata.jq
to a place where jq can find it as module.
One way to do so is to check out this repository to directory ~/.jq/wikidata/
:
mkdir -p ~/.jq && git clone https://github.com/nichtich/jq-wikidata.git ~/.jq/wikidata
Usage
The shortest method to use functions of this jq module is to directly include
the module. Try to process a single Wikidata entity (see below for details about per-item acces):
wget http://www.wikidata.org/wiki/Special:EntityData/Q42.json
jq 'include "wikidata"; .entities[].labels|reduceLabels' Q42.json
It is recommended to put Wikidata entities in a newline delimited JSON file:
jq -c .entities[] Q42.json > entities.ndjson
jq -c 'include "wikidata"; .labels|reduceLabels' entities.ndjson
More complex scripts should better be put into a .jq
file:
include "wikidata";
.labels|reduceLabels
The file can then be processed this way:
jq -f script.jq entities.ndjson
Process JSON dumps
Wikidata JSON dumps are made available at https://dumps.wikimedia.org/wikidatawiki/entities/. The current dumps exceed 35GB even in its most compressed form. The file contains one large JSON array so it should better be converted into a stream of JSON objects for further processing.
With a fast and stable internet connection it's possible to process the dump on-the fly like this:
curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 \
| bzcat | jq -nc --stream 'include "wikidata"; ndjson' | jq .id
Per-item access
JSON data for single entities can be ontained via the Entity Data URL. Examples:
- https://www.wikidata.org/wiki/Special:EntityData/Q42.json
- https://www.wikidata.org/wiki/Special:EntityData/L3006.json
- https://www.wikidata.org/wiki/Special:EntityData/L3006-F1.json
The module function entity_data_url
creates these URLs from Wikidata
itentifier strings. The resulting data is wrapped in JSON object; unwrap with
.entities|.[]
:
curl $(echo Q42 | jq -rR 'include "wikidata"; entity_data_url') | jq '.entities|.[]'
As mentioned above you better use wikidata-cli for accessing small sets of items:
wd d Q42
To get sets of items that match a given criteria either use SPARL or MediaWiki API module wbsearchentities and/or MediaWiki API module wbgetentities.
Reduce entity data
Use function reduceEntity or more specific functions (reduceInfo, reduceItem, reduceProperty, reduceLexeme) to reduce the JSON data structure without loss of essential information.
Furher select only some specific fields if needed:
jq '{id,labels}' entities.ndjson
API
Reduce Entity
Applies reduceInfo and one of reduceItem, reduceProperty, reduceLexeme.
reduceEntity
Reduce item
Simplifies labels, descriptions, aliases, claims, and sitelinks of an item.
reduceItem
Reduce property
Simplifies labels, descriptions, aliases, and claims of a property.
reduceProperty
Reduce labels
.labels|reduceLabels
Reduce descriptions
.descriptions|reduceDescriptions
Reduce aliases
.aliases|reduceAliases
Reduce sitelinks
.sitelinks|reduceSitelinks
Reduce lexeme
Simplifies lemmas, forms, and senses of a lexeme entity.
reduceLexeme
Reduce forms
.forms|reduceForms
Reduce senses
.senses|reduceSenses
Reduce claims
Removes unnecessary fields .id
, .hash
, .type
, .property
and simplifies
values for each claim.
.claims|reduceClaims
Reduce claim
Reduces a single claim value.
.claims.P26[]|reduceClaim
Reduce references
...
Reduce forms
Only lexemes have forms.
.forms|reduceForms
Reduce info
reduceInfo
Removes additional information fields pageid
, ns
, title
, lastrevid
, and modified
.
To remove selected field see jq function del
.
Stream an array of entities
Module function ndjson
can be used to process a stream with an array of
entities into a list of entities:
bzcat latest-all.json.bz2 | jq -n --stream 'import "wikidata"; ndjson'
Alternative, possibly more performant methods to process array of entities are described here:
bzcat latest-all.json.bz2 | head -n-1 | tail -n+2 | sed 's/,$//'
Contributing
The source code is hosted at https://github.com/nichtich/jq-wikidata.
Bug reports and feature requests are welcome!
License
Made available under the MIT License by Jakob Voß.