Word2vec for Akkadian

The present repository provides notebooks and files for drawing lemmatized Akkadian words out of ORACC projects. The purpose of this is to test word2vec on Akkadian vocabulary.

The Corpus

The texts that are used come from different periods and represent different text types. The text categories may be described by ORACC project:

Royal Inscriptions

RINAP: Neo-Assyrian royal inscription (ca. 900-612 BCE)
RIAO: Earlier Assyrian royal inscriptions (ca. 1900-100 BCE)
RIBO: Babylonian royal inscriptions (ca. 1100-64BCE)
SUHU: Royal inscriptions of two early first-millennium-BC rulers of the kingdom of Sūḫu

The RINAP and RIBO projects are further subdivided according to king (RINAP) or dynasty (RIBO).

Technical, scholarly, religious, and literary texts

GLASS: Middle Babylonian (ca. 1200 BCE) and Neo-Assyrian (ca. 625 BCE) technical recipes for the production of glass and perfumes.
CAMS/GKAB: Neo-Assyrian (ca 800-612BCE) and Late Babylonian (ca. 300BCE) scholarly libraries from Nimrud, Huzirina, and Uruk.
CCPO: Neo-Assyrian (ca. 650BCE) and Late Babylonian (ca. 300 BCE) scholarly commentaries on traditional texts.
CMAWRO: Old Babylonian (ca. 1800 BCE) and first millennium anti-witchcraft texts.
CAMS/ANZU: Reconstruction of the first millennium version of the three tablets (chapters) of the Anzu story.
BLMS: First millennium bilingual (Sumerian - Akkadian) texts from Babylonia and Assyria. Includes cultic songs, literary texts, and incantations.

Several of these projects include both Sumerian and Akkadian (often in the same document). For the purpose of this project only the Akkadian is included in the output.

Letters and Adminstrative Texts

AEMW/AMARNA: International correspondence of the Egyptian Pharaoh found in Amarna, ca 1300 BCE.
HBTIN: Hellenistic-period contracts from Uruk.
SAAO: letters and administrative texts from the Assyrian royal court (ca. 800-612 BCE).

The SAAO project is subdivided into 19 sub-projects according to the volumes in the SAA series. Each volume includes letters or documents relating to a particular king, or is devoted to a particular subject (such as divination). In the current output all SAAO documents are found in a single file.

A few projects have not been included here, for a variety of reasons. DCCMT and RIMANUM are in need of updating. The lexical texts in DCCLT do not provide a regular textual context and may therefore be less valuable for word2vec (one could argue, similarly, that the commentary texts in CCPO should not be included). It might be worth adding DCCLT and DCCLT/NINEVEH to see how this changes the outcomes of the analysis.

In collecting the data from ORACC only lemmatized texts have been taken into account. For an introduction to ORACC lemmatization see the documentation page.

The corpus currently contains more than 6,000 texts of different length.

Data and Data Format

The .csv files in the /output directory have been produced by the get_json.ipynb notebook in this repository. It parses the .json files that are produced by ORACC projects. For more information about the various .json files on ORACC see the Oracc Open Data page.

Each .csv file in /output has two fields: id_text and lemma. The field id_text contains a text ID that consists of a letter (P, Q, or X) and a six-digit number. The ID can easily be expanded into a URL that points at the online edition of the text, by combining it with the file name. The code for doing so is:

url = 'http://oracc.org/' + filename[:-4].replace('_', '/') + '/' + id_text

Thus the URL for id_text = Q006239 in the file ribo_babylon2.csv is http://oracc.org/ribo/babylon2/Q006239. Similarly, one may use the same data to reconstruct the name of the relevant .json file: http://oracc.org/ribo/babylon2/corpusjson/Q006239.json.

The field lemma consist of a concatenation of all the lemmas in that text in the original order. The format of a lemma is:

CitationForm[GuideWord]PartofSpeech (abbreviated as CF[GW]POS).

An example is:

immeru[sheep]N

The GW, technically, is not a translation but a disambiguator for homonyms. The complete lemma is a pointer to an entry in a standard dictionary (the Concise Dictionary of Akkadian by N. Postgate and J. Black). The GW does not take into account contextual meaning. Thus the word muhhu[skull]N is most commonly used in the prepositional expression ina muhhi, lemmatized ina[in]PRP muhhu[skull]N, meaning 'upon'.

In the present data-representation lemmas never include spaces or commas (commas and spaces in the original ORACC lemmatizations have been replaced by - and nothing, respectively). The regular expression for identifying tokens, therefore, is very simple:

[^ ]+

That is: any sequence of signs that does not include a space.

If a word cannot be lemmatized (because it is unknown or broken) it is represented in its (sign-by-sign) transliteration, followed by NA as GW and POS, for example:

x-ka-ti[NA]NA

These non-lemmas (which are meaningless for all practical purposes) are included in the data because they may separate words that would otherwise seem to be adjacent.

Many of these non-lemmas are introduced by a dollar sign (\$). The \$ indicates that the reading of the signs in question is uncertain (many cuneiform signs are polyvalent). For the present purposes, this is irrelevant and I recommend removing all \$ signs (there are no other places where \$ is used).

In theory, lemmatization follows strict rules that ensures that the same word is lemmatized the same way across projects. In practice, this has often gone wrong so that the same word may be represented differently in the corpus. Examples are:

būru[(bull)calf]N vs. būru[(bull)-calf]N

ṭēmu[(fore)thought]N vs. ṭēmu[instruction]N

Occasionaly there may also be different versions of the CF. For further exploration it would, therefore, be advantageous to create a cleaning pipeline with Openrefine or a similar tool. For now, the data may be explored the way they are.

niekveldhuis / Word2vec