direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spaCy Project: Parsing the Jingdian Shiwen

Open in Streamlit

This project is an attempt to convert the annotations compiled by the Tang dynasty scholar Lu Deming (陸德明) in the Jingdian Shiwen (经典释文) into a structured form that separates phonology, glosses, and references to secondary sources. A spaCy pipeline is configured to parse and tag the annotations, and prodigy is used for guided annotation of the training data. The project is part of a broader effort to build a linguistic model of Old Chinese (上古漢語) that incoporates phonology.

Data

The Jingdian Shiwen comprises Lu's annotations on most of the "Thirteen Classics" (十三經) of the Confucian tradition, as well as some Daoist texts. We use the edition of the Jingdian Shiwen found in the Collectanea of the Four Categories (四部叢刊), which includes high-quality lithographic reproductions of many ancient texts. The annotations given in the Jingdian Shiwen are paired with the source texts to which they apply; for this we predominantly use the definitive (正文) editions published by the Kanseki Repository.

work title source Jingdian Shiwen chapters (卷)
周易 Book of Changes KR1a0001 2
尚書 Book of Documents KR1b0001 3-4
毛詩 Mao Commentary on the Book of Odes KR1c0001 5-7
周禮 Rites of Zhou KR1d0001 8-9
儀禮 Etiquette and Ceremonial CH1e0873* 10
禮記 Book of Rites KR1d0052 11-14
春秋左傳 Commentary of Zuo on the Spring and Autumn Annals KR1e0001 15-20
春秋公羊傳 Commentary of Gongyang on the Spring and Autumn Annals CH1e0877* 21
春秋穀梁傳 Commentary of Guliang on the Spring and Autumn Annals KR1e0008 22
孝經 Classic of Filial Piety KR1f0001 23
論語 Analects of Confucius KR1h0004 24
老子 Laozi KR5c0057 25
莊子 Zhuangzi KR5c0126 26-28

*This data is sourced with permission from the China Ancient Texts (CHANT) database.

We omit chapter 1 of the Jingdian Shiwen, corresponding to the Erya (爾雅). All digital sources have been preprocessed to remove punctuation, whitespace, and non-Chinese characters. Kanseki Repository data is generously licensed CC-BY.

After processing, the labeled output data is saved in JSON-lines (.jsonl) format, to be used for machine learning, natural language processing, and other computational applications.

Annotating

To annotate training data, you need to have spacy installed in your python environment:

pip install spacy

You also need a copy of prodigy. Once you have the appropriate wheel, install it with:

# example: prodigy version 1.11.8 for python 3.10 on windows
pip install prodigy-1.11.8-cp310-cp310-win_amd64.whl

Then, verify the project assets are downloaded:

spacy project assets

Install python dependencies needed for annotation:

spacy project run install

Then, choose a task (see "commands" below). Invoke it with e.g.:

# annotate data by correcting predictions
spacy project run annotate

project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command Description
install Install dependencies
annotate-spans Annotate spans by correcting predictions based on heuristics
export Export training data from prodigy's database for use with spaCy
train Train a spaCy pipeline

Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File Source Description
assets/docs.csv Local Table mapping each chapter in a source text to its location in the Jingdian Shiwen
assets/variants.json Local Equivalency table for graphic variants of characters
assets/treebank Git Universal Dependencies treebank for Classical Chinese

Parameters

Parameter Description
embedding Choose an embedding layer implementation (spaCy's Tok2Vec or Transformer)
suggester Choose between two span suggester architectures (SpanFinder, Ngram)
tranformer_model_name Choose a transformer model from HuggingFace (if using Transformer as the embedding layer)
gpu_id Choose whether you want to use your GPU (device number) or CPU (-1)

About

Parsing the "Jingdian Shiwen" with spaCy

License:MIT License


Languages

Language:Jupyter Notebook 81.6%Language:Python 17.6%Language:HTML 0.8%Language:CSS 0.0%