spaCy Project: Parsing the Jingdian Shiwen

This project is an attempt to convert the annotations compiled by the Tang dynasty scholar Lu Deming (陸德明) in the Jingdian Shiwen (经典释文) into a structured form that separates phonology, glosses, and references to secondary sources. A spaCy pipeline is configured to parse and tag the annotations, and prodigy is used for guided annotation of the training data. The project is part of a broader effort to build a linguistic model of Old Chinese (上古漢語) that incoporates phonology.

Data

The Jingdian Shiwen comprises Lu's annotations on most of the "Thirteen Classics" (十三經) of the Confucian tradition, as well as some Daoist texts. We use the edition of the Jingdian Shiwen found in the Collectanea of the Four Categories (四部叢刊), which includes high-quality lithographic reproductions of many ancient texts. The annotations given in the Jingdian Shiwen are paired with the source texts to which they apply; for this we predominantly use the definitive (正文) editions published by the Kanseki Repository.

work	title	source	Jingdian Shiwen chapters (卷)
周易	Book of Changes	KR1a0001	2
尚書	Book of Documents	KR1b0001	3-4
毛詩	Mao Commentary on the Book of Odes	KR1c0001	5-7
周禮	Rites of Zhou	KR1d0001	8-9
儀禮	Etiquette and Ceremonial	CH1e0873*	10
禮記	Book of Rites	KR1d0052	11-14
春秋左傳	Commentary of Zuo on the Spring and Autumn Annals	KR1e0001	15-20
春秋公羊傳	Commentary of Gongyang on the Spring and Autumn Annals	CH1e0877*	21
春秋穀梁傳	Commentary of Guliang on the Spring and Autumn Annals	KR1e0008	22
孝經	Classic of Filial Piety	KR1f0001	23
論語	Analects of Confucius	KR1h0004	24
老子	Laozi	KR5c0057	25
莊子	Zhuangzi	KR5c0126	26-28

*This data is sourced with permission from the China Ancient Texts (CHANT) database.

We omit chapter 1 of the Jingdian Shiwen, corresponding to the Erya (爾雅). All digital sources have been preprocessed to remove punctuation, whitespace, and non-Chinese characters. Kanseki Repository data is generously licensed CC-BY.

After processing, the labeled output data is saved in JSON-lines (.jsonl) format, to be used for machine learning, natural language processing, and other computational applications.

Annotating

To annotate training data, you need to have spacy installed in your python environment:

pip install spacy

You also need a copy of prodigy. Once you have the appropriate wheel, install it with:

# example: prodigy version 1.11.8 for python 3.10 on windows
pip install prodigy-1.11.8-cp310-cp310-win_amd64.whl

Then, verify the project assets are downloaded:

spacy project assets

Install python dependencies needed for annotation:

spacy project run install

Then, choose a task (see "commands" below). Invoke it with e.g.:

# annotate data by correcting predictions
spacy project run annotate

project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`install`	Install dependencies
`annotate-spans`	Annotate spans by correcting predictions based on heuristics
`export`	Export training data from prodigy's database for use with spaCy
`train`	Train a spaCy pipeline

Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File	Source	Description
`assets/docs.csv`	Local	Table mapping each chapter in a source text to its location in the Jingdian Shiwen
`assets/variants.json`	Local	Equivalency table for graphic variants of characters
`assets/treebank`	Git	Universal Dependencies treebank for Classical Chinese

Parameters

Parameter	Description
`embedding`	Choose an embedding layer implementation (spaCy's Tok2Vec or Transformer)
`suggester`	Choose between two span suggester architectures (SpanFinder, Ngram)
`tranformer_model_name`	Choose a transformer model from HuggingFace (if using Transformer as the embedding layer)
`gpu_id`	Choose whether you want to use your GPU (device number) or CPU (-1)

direct-phonology / jdsw