nichtich / jq-wikidata

jq module to process Wikidata JSON format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

jq-wikidata

Build Status

jq module to process Wikidata JSON format

This git repository contains a module for the jq data transformation language to process entity data from Wikidata or other Wikibase instances serialized in its JSON format.

Several methods exist to get entity data from Wikidata. This module is designed to process entities in their JSON serialization especially for large numbers of entities. Please also consider using a dedicated client such as wikidata-cli instead.

Table of Contents

Install

Installation requires jq version 1.5 or newer.

Put wikidata.jq to a place where jq can find it as module. One way to do so is to check out this repository to directory ~/.jq/wikidata/:

mkdir -p ~/.jq && git clone https://github.com/nichtich/jq-wikidata.git ~/.jq/wikidata

Usage

The shortest method to use functions of this jq module is to directly include the module. Try to process a single Wikidata entity (see below for details about per-item acces):

wget http://www.wikidata.org/wiki/Special:EntityData/Q42.json
jq 'include "wikidata"; .entities[].labels|reduceLabels' Q42.json

It is recommended to put Wikidata entities in a newline delimited JSON file:

jq -c .entities[] Q42.json > entities.ndjson
jq -c 'include "wikidata"; .labels|reduceLabels' entities.ndjson

More complex scripts should better be put into a .jq file:

include "wikidata";

.labels|reduceLabels

The file can then be processed this way:

jq -f script.jq entities.ndjson

Process JSON dumps

Wikidata JSON dumps are made available at https://dumps.wikimedia.org/wikidatawiki/entities/. The current dumps exceed 35GB even in its most compressed form. The file contains one large JSON array so it should better be converted into a stream of JSON objects for further processing.

With a fast and stable internet connection it's possible to process the dump on-the fly like this:

curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 \
  | bzcat | jq -nc --stream 'include "wikidata"; ndjson' | jq .id

Per-item access

JSON data for single entities can be ontained via the Entity Data URL. Examples:

The module function entity_data_url creates these URLs from Wikidata itentifier strings. The resulting data is wrapped in JSON object; unwrap with .entities|.[]:

curl $(echo Q42 | jq -rR 'include "wikidata"; entity_data_url') | jq '.entities|.[]'

As mentioned above you better use wikidata-cli for accessing small sets of items:

wd d Q42

To get sets of items that match a given criteria either use SPARL or MediaWiki API module wbsearchentities and/or MediaWiki API module wbgetentities.

Reduce entity data

Use function reduceEntity or more specific functions (reduceInfo, reduceItem, reduceProperty, reduceLexeme) to reduce the JSON data structure without loss of essential information.

Furher select only some specific fields if needed:

jq '{id,labels}' entities.ndjson

API

Reduce Entity

Applies reduceInfo and one of reduceItem, reduceProperty, reduceLexeme.

reduceEntity

Reduce item

Simplifies labels, descriptions, aliases, claims, and sitelinks of an item.

reduceItem

Reduce property

Simplifies labels, descriptions, aliases, and claims of a property.

reduceProperty

Reduce labels

.labels|reduceLabels

Reduce descriptions

.descriptions|reduceDescriptions

Reduce aliases

.aliases|reduceAliases

Reduce sitelinks

.sitelinks|reduceSitelinks

Reduce lexeme

Simplifies lemmas, forms, and senses of a lexeme entity.

reduceLexeme

Reduce forms

.forms|reduceForms

Reduce senses

.senses|reduceSenses

Reduce claims

Removes unnecessary fields .id, .hash, .type, .property and simplifies values for each claim.

.claims|reduceClaims

Reduce claim

Reduces a single claim value.

.claims.P26[]|reduceClaim

Reduce references

...

Reduce forms

Only lexemes have forms.

.forms|reduceForms

Reduce info

reduceInfo

Removes additional information fields pageid, ns, title, lastrevid, and modified.

To remove selected field see jq function del.

Stream an array of entities

Module function ndjson can be used to process a stream with an array of entities into a list of entities:

bzcat latest-all.json.bz2 | jq -n --stream 'import "wikidata"; ndjson'

Alternative, possibly more performant methods to process array of entities are described here:

bzcat latest-all.json.bz2 | head -n-1 | tail -n+2 | sed 's/,$//'

Contributing

The source code is hosted at https://github.com/nichtich/jq-wikidata.

Bug reports and feature requests are welcome!

License

Made available under the MIT License by Jakob Voß.

About

jq module to process Wikidata JSON format

License:MIT License


Languages

Language:JSONiq 74.1%Language:Shell 24.2%Language:Makefile 1.8%