Heliozoa / jadata

Japanese language files for use by LBR.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

jadata

Generates the kanjifile and wordfile files used by lbr.

Both files are derived from "skeleton" files that contain stable ids mapped to each entry. The skeletons are then filled with data from the following files:

kanjifile_skeleton.json:

  • KANJIDIC2 (kanjidic2.xml) from The Electronic Dictionary Research and Development Group. Contains a list of kanji and their meanings.
  • KRADFILE (kradfile) from the The Electronic Dictionary Research and Development Group. Contains decompositions of each kanji into common "components".

wordfile_skeleton.json:

  • JMdict (JMdict_e_examp.xml) from The Electronic Dictionary Research and Development Group. Contains a list of words and phrases, their readings and meanings.
  • JmdictFurigana (JmdictFurigana.json) from Doublevil. Contains the readings for each word in JMdict assigned as furigana.

The core concept is that the kanjifile and wordfile can easily be updated both from new versions of KANJIDIC2 and JMdict, as well as with manual updates for the needs of jadata such as kanji names and the list of similar kanji by updating the skeleton. This way it's not necessary to store the large, complete files in version control.

Differences from JMdict

jadata's definition of a "word" is a little different from JMdict's. Essentially, jadata prioritises the "written form" of the word in order to make things easier for a Japanese learner, whereas JMdict prioritises the "meaning" of the word as a dictionary would.

For example, in jadata 船 and 舟 are two different words, both meaning ship and both read ふね, whereas in JMdict they are both grouped as two different ways to write the same word. When learning Japanese, you would have to learn each written form separately, and so jadata considers them their own individual words.

Crates

jadata_cli

A binary crate that implements functionality for generating and updating the kanjifile.json and wordfile.json files.

jadata

A library crate which contains the Kanjifile and Wordfile data structures and logic for serializing and deserializing them.

Updating the skeletons

See the files in the scripts directory, or use the CLI manually with cargo run. The wordfile is large so updating it may take a moment.

License

jadata's code is licensed under MPL-2.0.

The files in ./included as well as the files created by the program are licensed under CC BY-SA 4.0, matching the license the generated files are derived from.

About

Japanese language files for use by LBR.

License:Mozilla Public License 2.0


Languages

Language:Rust 97.0%Language:Shell 3.0%