envlh / henry

Scripts to import a Breton dictionary by Victor Henry from Wikisource to Wikidata's lexicographical data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Description

Scripts to import the dictionary Lexique Ă©tymologique du breton moderne (Q19216625) by Victor Henry (Q1386172) from Wikisource to Wikidata's lexicographical data. This dictionary is in French about the Breton language.

Dependencies

  • PHP 7
  • Python 3

Installation

Install the dependencies. Example on a Debian-like system:

apt install php python3 python3-pip

Download the project:

git clone "https://github.com/envlh/henry.git"

Install the Python requirements. Example of the command to use at the root of the project:

pip3 install -r requirements.txt

Configuration

The bot uses Pywikibot. A way to login to Wikidata is to use a bot password.

Download Pywikibot:

git clone "https://gerrit.wikimedia.org/r/pywikibot/core"

After creating your bot password, generate configuration files:

python3 pwb.py generate_user_files.py

Copy generated files user-config.py and user-password.py at the root of the henry project.

Usage

Crawler

Retrieves content from Wikisource, aggregates all pages in one file, and does some cleaning.

php -f crawler.php

Several files are generated:

  • wikitext.txt: raw wikitext crawled from Wikisource (useful for debug)
  • stripped.txt: wikitext after cleaning

Parser

Parses previously created file and converts it into machine-readable format.

python3 parser.py

Several files are generated:

  • lexemes.json: lexemes that will be imported in Wikidata, serialized in Wikibase JSON format
  • lexemes.txt: more human-readable list of lexemes that will be imported
  • errors.json: rejected lexemes, with reason of error
  • monograms.json and bigrams.json: frequencies of letters in lemmas

Import

Imports the data in Wikidata's lexicographical data.

python3 bot.py

Copyright

This project, mainly by Envel Le Hir (@envlh) for the code and Nicolas Vigneron (@belett) for the Wikisource transcription, is under CC0 license (public domain dedication).

About

Scripts to import a Breton dictionary by Victor Henry from Wikisource to Wikidata's lexicographical data.

License:Creative Commons Zero v1.0 Universal


Languages

Language:Python 71.7%Language:PHP 28.3%