DeSDA

Welcome to the working repository of my PhD research on the automatic detection of syntactic difference (DeSDA). All tools that I developed, as well as datasets that I compiled, for the purposes of my PhD research have been uploaded here, along with relevant output. My dissertation is yet to be published.

This repository consists of three main folders, corresponding to the central chapters of my dissertation.

Chapter 2 - Filter

The folder Chapter 2 - Filter contains the tools and data described in Chapter 2 of my dissertation and in Kroon, Barbiers, Odijk and van der Pas (2019).

The folder contains two subfolders:

Data

The Data/ folder contains two types of relevant files types:

*.raw files (e.g. de-en.raw) consist of 400 sentence pairs from Koehn's (2005) Europarl corpus (of the two languages indicated by the language abbreviations in the file name) separated by a tab. All words have been POS tagged (word|POS) with the POS tags having been taken directly from the Europarl corpus metadata. These metadata tags have been translated into Universal Dependencies (Nivre et al. 2016) using the files in Data/tagset_translations/.
*.train files (e.g. de-en.train) contain the 400 sentence pairs from the *.raw files with a label (Y|N) for whether the sentence pair is syntactically comparable or not.

The folder furthmore contains UDPipe_models/, containing models for UDPipe (Straka and Straková 2017), a dependency parser for UD, for convenience.

Tools

The Tools/ folder contains all the relevant code to run the filters as described in Chapter 2 of my PhD dissertation and in Kroon, Barbiers, Odijk and van der Pas (2019)..

AUC_evaluator.py is used to automatically find the best parameter settings of each individual filter based on the *.train files (see above). The variables (which data to use and which UDPipe models to use) are changed within the file. The code reports on the AUC and the best threshold setting based on Youden's J statistic (Youden 1950) and the Euclidean distance for every parameter setup (described in Chapter 2). The script makes use of some multiprocessing, and relies on levenshtein.py, senlen_ratio.py and networkx_GED. Unfortunately, the output is too large to be uploaded.
AUC_evaluator.logreg.py is used to automatically find the best parameter settings of the logistic regression filter based on the *.train files (see above). The variables (which data to use and which UDPipe models to use) are also changed within the file. This script reports on the AUC, but not the best threshold setting (which is always 50%; the AUC is calculated to be able to compare the results). The script also makes use of some multiprocessing, and relies on levenshtein.py, senlen_ratio.py and networkx_GED.py. Unfortunately, the output is too large to be uploaded.
as mentioned, levenshtein.py, senlen_ratio.py and networkx_GED.py are necessary to automatically find the best parameter setup for the filters using the two scipts described above.
levenshtein_filter.py, senlen_ratio_filter.py and networkx_GED_filter.py, on the other hand, take manually set parameters (changed in the file), and take the *.raw files (see above) as input, outputting the dataset with syntactically incomparable sentence pairs filtered out.
logreg_filter.py is the logistic regression filter. It allows for the parameters of the filters it uses to be set manually (changed in the file), and uses *.train files (see above) to train a classifier, and to filter out syntactically incomparable sentence pairs from it.

Chapter 3 - MDL

The folder Chapter 3 - MDL contains the tools and data described in Chapter 3 of my dissertation and in Kroon, Barbiers, Odijk and van der Pas (2020). The README describes clearly how to recreate the research.

In the folder one can find, among other things, MDL_difference_detector.py, the main tool to detect syntactic differences using MDL. Variables are set within the Python file. Revelant are:

setup, which sets how the script should be run: with or without filtered data (the first character), with or without superpattern subtraction (the second character). setup must be (NN|NY|YN|YY).
lang_a and lang_b, which correspond to the language abbreviations used in the Data/ folder.

MDL_difference_detector.py takes specifically formatted input. Please refer to the README to recreate the research.

The output of MDL_difference_detector.py can be found in Output/.

Chapter 4 - Alignment

The folder Chapter 4 - Alignment contains the tools and data described in Chapter 4 of my dissertation.

The folder contains three subfolders. For more information, please refer to the README.

Data

The Data/ folder contains:

en-hu: contains data files relevant to word alignment with eflomal (Östling and Tiedemann 2016), such as input and output;
- en-hu.eflomal.txt: sentence pairs formatted for input;
- the rest are output files;
python: contains an English and Hungarian Bible (from Christodoulopoulos and Steedman 2015), with one verse ID and verse per line, but only those verses that are present in both versions of the Bible;
- xml_aligner.py: can be used to align the XML Bibles from Christodouloupoulos and Steedman (2015) such that the output contains only the verses present in both translations.

Tools

The Tools/ folder contains the three main tools developed for Chapter 4, which can be used to detect syntactic differences:

Output

The Output/ folder contains all the relevant output:

AAA_en-hu.txt: the output of AAA;
DGAE_en-hu_deprel.txt: the output of DGAE grouping over deprel;
DGAE_en-hu_pos.txt: the output of DGAE grouping over pos;
DGAE_en-hu_pos_deprel.txt: the output of DGAE grouping over pos and deprel;
GTI_en-hu_deprel.fragment.txt: a fragment (first 50.000 lines) of the output of GTI pre-splitting over deprel;
GTI_en-hu_pos.fragment.txt: a fragment (first 50.000 lines) of the output of GTI pre-splitting over pos;
GTI_en-hu_pos_deprel.fragment.txt: a fragment (first 50.000 lines) of the output of GTI pre-splitting over pos and deprel.

kirianguiller / DeSDA

DeSDA

Chapter 2 - Filter

Data

Tools

Chapter 3 - MDL

Chapter 4 - Alignment

Data

Tools

Output

About

Languages