bible-translations kriging multidimensional-scaling parallel-corpora word-alignment

parallelbibles

Word-alignment models for Bible translations in 100+ historical and contemporary languages

Requirements

Installation and dependencies:
- Download or clone the repository:
  
  $ git clone https://github.com/npedrazzini/parallelbibles
- From the root directory (./parallelbibles), build the repository:
  
  $ make
This will download and build SyMGIZA++ [1] and install all the required dependencies in a venv called parallels-venv.
XML files, which can be of two formats:
- OPUS (untokenized) (from https://opus.nlpl.eu/bible-uedin.php)
- PROIEL (from https://proiel.github.io)
This repository comes with OPUS XMLs (inside original-xmls/opus-xmls) and PROIEL XMLs for New Testament Greek, Old Church Slavonic and Gothic (inside original-xmls/proiel-xmls).

Train word-alignment models

This repository already comes with four pre-trained models. Check them out!

$ ./train.sh

This step will:

convert OPUS/PROIEL XML files to GIZA-readable CSV files
train a word-alignment model for each target language
make GIZA's outputs easily readable and queryable

You will be prompted to:

specify the input XML format (OPUS, PROIEL, or mixed)
enter the desired source language
enter the target languages (or have all the remaining as targets)
specify if you want to strip punctuation
specify if you want to bring everything to lowercase
provide a name for your model

NB: the chosen languages must be entered in their ISO 639-3 code. See here for the complete list and the table below for the languages included in the models.

Extract words and their translations

$ ./extract.sh

This step will:

extract every occurrence of a word (or multiple words) in the source language and its translation in the target languages.
(optionally) generate scripts to run multidimensional scaling (MDS) on the dataset and Kriging (to draw lines around clusters probabilstically)

You will be prompted to enter:

the name of the model you want to use (e.g. 'model2-LC-NP')
a target word (e.g. 'when') or multiple target words separated by hyphen (e.g. 'when-while-since')
whether you want to generate the scripts necessary to run MDS on the dataset ('yes' or 'no')
whether you also want to apply Kriging to the MDS maps ('yes' or 'no')
whether you only want to extract words from the New Testament ('yes') or from both the Old and the New Testament ('no') ^*

The output will be a folder named as the target word (or words, hyphen-separated, if extracting multiple words at once) containing the following:

word.csv: CSV file for each word. The file will contain one occurrence per line, its citation (Bible verse), context, and the translations in each target language ^**.

And if you chose to run MDS (with and without Kriging) it will also contain:

word-MDS.R: an R script to run MDS (and Kriging, if you chose to), generating a single PDF with one map per language. These maps are static and generated using base R. Best for distant-reading stages in the data exploration ^***.
word-plotly.R: an R script (alternative to word-MDS.R) generating multiple HTML files using the R package plotly. These maps are interactive and let you hover over the data points and look at the citation (Bible verse) and source word in context. Best for close-reading stages in the data exploration.
word-data.txt: the original data in TXT format and the citation (Bible verse) as index (rather than column, as in word.csv) and without the 'context' column.
word-matrix.txt: distance matrix between source word and target words.

* This is because many languages lack the whole or large sections of the Old Testament, which will result in your dataset having many NAs (which you may or may not want to avoid).

** NB: NULL will indicate that the model did not find a match for the word in the target language. NA will indicate that the target language did not have a Bible translations of that particular verse in the first place (e.g. some languages lack a translation for the whole Old Testament).

*** NB 1: This script is a heavy adaptation of the code by [2]. NB 2: The lmap function relies on the R package qlcVisualize. If you have issues installing it, simply save the two functions we need from that package by running the script ./scripts/postprocessing/lmap-boundary-functions.R included in this repository. NB 3: The MDS script has been adapted so that it merges all translations with less than 10 occurrences with NULLs. The '10' threshold is arbitrary and was based on what seemed to be a common cut-off point between 'real' translations in the target language and casual correspondence between the source word and a specific lexical item in the target language.

Hierarchical clusters and NeighborNets

./scripts/postprocessing/splitstree.R: this script will perform hierarchical clustering and NeighborNet analysis of the languages based on a criterion x (default: NULL-constructions).

It takes as input the file word-data.txt described above.

The script will:

Plot a simple hierarchical cluster of the languages in a parallel-word dataset. It currently shows how similar languages appear to be based on NULL-construction distributions.
Generate a Nexus (.nex) file for NeighborNet analysis, to be visualized with the SplitsTree4 software. Similar to a traditional hierarchical cluster in many ways, a NeighborNet will simply not force a binary-tree type of classification.

Pretrained models

NB: model2-LC-NP is stored in this repo using Git LFS. If you wish to use that model, you should have Git LFS installed, else you will only see a pointer file.

Four pretrained models currently come with this repository:

model1-UC-P: Upper case and with Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
model2-LC-NP: Lower Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
model3-UC-NP: Upper Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
model4-LC-P: Lower Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.

You can directly extract target words from either of these models by running $ ./extract.sh. You will be prompted to enter the name of the model you want to use.

Languages

OT = Old Testament

NT = New Testament

ISO 639-3	Language	Language family	OT	NT	Notes
acu	Achuar-Shiwiar	Jivaroan	N	Y
afr	Afrikaans	Indo-European > Germanic	Y	Y
agr	Awajún	Jivaroan	N	Y
ake	Akawaio	Cariban	N	Y
sqi/alb	Albanian	Indo-European	Y	Y
amh	Amharic	Afro-Asiatic > Semitic	Y	N
amu	Guerrero Amuzgo	Otomanguean	N	Y
ara	Arabic	Afro-Asiatic > Semitic	Y	Y
hye/arm	Armenian	Indo-European	Y	Y
baq	Basque	Isolate	N	Y
bsn	Barasana-Eduria	Tucanoan	N	Y
bul	Bulgarian	Indo-European > Balto-Slavic	Y	Y
cak	Kaqchikel	Mayan	N	Y
ceb	Cebuano	Austronesian > Malayo-Polynesian	Y	Y
cha	Chamorro	Austronesian > Malayo-Polynesian	Y	Y	OT only consists of the Psalms
zho/chi	Chinese	Sino-Tibetan > Sinitic	Y	Y
chq	Quiotepec Chinantec	Otomanguean	N	Y
chr	Cherokee	Iroquoian	N	Y
chu	Church Slavonic	Indo-European > Balto-Slavic	N	Y
cjp	Cabécar	Chibchan	N	Y
cni	Asháninka	Maipurean	N	Y
cop	Coptic	Afro-Asiatic > Egyptian	N	Y
crp	Creoles and pidgins	Creole > French-based	Y	Y	The original XML files have the generic 'crp' code. This is however Haitian Creole (code hat)
cze	Czech	Indo-European > Balto-Slavic	Y	Y
dan	Danish	Indo-European > Germanic	Y	Y
deu	German	Indo-European > Germanic	Y	Y
dik	Southwestern Dinka	Nilo-Saharan > Nilotic	N	Y
dje	Zarma	Nilo-Saharan > Songhai	Y	Y
dop	Lukpa	Niger-Congo > Atlantic-Congo	N	Y
epo	Esperanto	Constructed	Y	Y
est	Estonian	Uralic	Y	Y
ewe	Ewe	Niger-Congo > Atlantic-Congo	N	Y
fin	Finnish	Uralic	Y	Y
fra	French	Indo-European > Italic	Y	Y
gbi	Galela	West Papuan	N	Y
gla	Scottish Gaelic	Indo-European > Celtic	N	Y	The only text included is the Gospel of Mark
glv	Manx	Indo-European > Celtic	Y	Y	The only text from the OT is the Book of Esther
got	Gothic	Indo-European > Germanic	N	Y
grc	Ancient Greek (to 1453)	Indo-European	N	Y
ell/gre	Modern Greek (1453-)	Indo-European	Y	Y
guj	Gujarati	Indo-European > Indo-Iranian	N	Y
heb	Hebrew	Afro-Asiatic > Semitic	Y	N
hin	Hindi	Indo-European > Indo-Iranian	Y	Y
hrv	Croatian	Indo-European > Balto-Slavic	Y	Y
hun	Hungarian	Uralic	Y	Y
ind	Indonesian	Austronesian > Malayo-Polynesian	Y	Y
isl	Icelandic	Indo-European > Germanic	Y	Y
ita	Italian	Indo-European > Italic	Y	Y
jak	Jakun	Austronesian > Malayo-Polynesian	N	Y
jap	Japanese	Japonic	Y	Y
jiv	Shuar	Jivaroan	N	Y
kab	Kabyle-Amazigh	Afro-Asiatic > Berber	N	Y
kbh	Camsá	Isolate	N	Y
kor	Korean	Koreanic	Y	Y
lat	Latin	Indo-European > Italic	Y	Y
lav	Latvian	Indo-European > Balto-Slavic	N	Y
lit	Lithuanian	Indo-European > Balto-Slavic	Y	Y
mal	Malayalam	Dravidian	Y	Y
mam	Mam	Mayan	N	Y
mao	Maori	Austronesian > Malayo-Polynesian	Y	Y
mar	Marathi	Indo-European > Indo-Iranian	Y	Y
mya	Burmese	Sino-Tibetan > Tibeto-Burman	Y	Y
nep	Nepali	Indo-European > Indo-Iranian	Y	Y
nhg	Tetelcingo Nahuatl	Uto-Aztecan	N	Y
nld	Dutch	Indo-European > Germanic	Y	Y
nor	Norwegian	Indo-European > Germanic	Y	Y
ojb	Northwestern Ojibwa	Algic > Algonquian	N	Y
pck	Paite Chin	Sino-Tibetan > Tibeto-Burman	Y	Y
pes	Iranian Persian	Indo-European > Indo-Iranian	Y	Y
plt	Plateau Malagasy	Austronesian > Malayo-Polynesian	Y	Y
pol	Polish	Indo-European > Balto-Slavic	Y	Y
por	Portuguese	Indo-European > Italic	Y	Y
pot	Potawatomi	Algic > Algonquian	N	Y
ppk	Uma	Austronesian > Malayo-Polynesian	N	Y
quc	K'iche'	Mayan	N	Y
quw	Tena Lowland Quichua	Quechuan	N	Y
rom	Romany	Indo-European > Indo-Iranian	N	Y
ron/rum	Romanian	Indo-European > Italic	Y	Y
rus	Russian	Indo-European > Balto-Slavic	Y	Y
shi	Tachelhit	Afro-Asiatic > Berber	N	Y
slk	Slovak	Indo-European > Balto-Slavic	Y	Y
slv	Slovenian	Indo-European > Balto-Slavic	Y	Y
sna	Shona	Niger-Congo > Atlantic-Congo	Y	Y
som	Somali	Afro-Asiatic > Cushitic	Y	Y
spa	Spanish	Indo-European > Italic	Y	Y
srp	Serbian	Indo-European > Balto-Slavic	Y	Y
ssw	Swati	Niger-Congo > Atlantic-Congo	N	Y
swe	Swedish	Indo-European > Germanic	Y	Y
syr	Syriac	Afro-Asiatic > Semitic	N	Y
tel	Telugu	Dravidian	Y	Y
tgl	Tagalog	Austronesian > Malayo-Polynesian	Y	Y
tha	Thai	Kra-Dai > Tai	Y	Y
tmh	Tamashek	Afro-Asiatic > Berber	Y	Y
tur	Turkish	Turkic	Y	Y
ukr	Ukrainian	Indo-European > Balto-Slavic	N	Y
usp	Uspanteco	Mayan	N	Y
wal	Wolaytta	Afro-Asiatic > Omotic	N	Y
wol	Wolof	Niger-Congo > Atlantic-Congo	N	Y
xho	Xhosa	Niger-Congo > Atlantic-Congo	Y	Y
zul	Zulu	Niger-Congo > Atlantic-Congo	N	Y

TODO

Include the following languages: a. In all models: vie, kan, djk, kek, agr, mal b. In model4-LC-P only: mar, mya, nep, tel
Fix issue with display of some non-Latin characters in PDF output (notably all Arabic!). Note that the characters display normally in R studio (i.e. it must be an issue with both base R pdf and CairoPDF).
Add info on how NULLs are treated in the models.
Add on how many NAs we have per language based on best model.

References

[1] Junczys-Dowmunt, Marcin & Arkadiusz Szał. 2012. SyMGiza++: Symmetrized Word Alignment Models for Machine Translation. In Pascal Bouvry, Mieczyslaw A. Klopotek, Franck Leprévost, Malgorzata Marciniak, Agnieszka Mykowiecka & Henryk Rybinski (eds.), Security and Intelligent Information Systems (SIIS) (Lecture Notes in Computer Science 7053), 379-390. Heidelberg-Berlin: Springer.

[2] Wälchli, Bernhard. 2010. Similarity Semantics and Building Probabilistic Semantic Maps from Parallel Texts. Linguistic Discovery 8(1). 331-371. DOI:10.1349/PS1.1537-0852.A.356

About

Word-alignment models for Bible translations in 100+ historical and contemporary languages

bible-translations kriging multidimensional-scaling parallel-corpora word-alignment

Languages

Language:R 55.1%Language:Python 41.7%Language:Shell 2.6%Language:Makefile 0.5%