ud_folder

Introduction

This is a folder for scripts related to UD Parsing.

At the moment, the scripts are designed to turn gold CoNLLU files to UDPipe predicted output. This is useful to evaluate parsing systems which do not take in gold CoNLLU data as input, e.g. in the CoNLL UD Parsing shared tasks where systems are expected to parse raw text or UDPipe predicted output.

As a way of evaluating a parsing system after the official TIRA runs and when the VM has been re-allocated to TIRA, it will be useful to have a folder which contains the test data as it is found on TIRA (e.g. predicted by the baseline UDPipe) to carry on further experiments, perform sanity checks etc.

Requirements

In order to use these scripts you will need to get UDPipe:

git clone https://github.com/ufal/udpipe
cd udpipe/src
make

Then copy the udpipe/src/udpipe binary executable to your $PATH or link it to /usr/bin/ etc.

Baseline models

You can download the baseline UDPipe 2.2 models which were used for the shared task here: http://universaldependencies.org/conll18/baseline.html

Instructions

Using UDPipe to tokenize, tag and parse the data:

To get the UD 2.2 treebanks run get_data.sh
To create a folder for each treebank and convert the CoNLLU test file back to text run create_test_text_file.sh
To use UDPipe to predict on the test.txt files to generate UDPipe predcited CoNLLU files run udpipe_test.sh
To finish the remaining models which include the PUD cases run low_resource.sh

Afterwards, we want to check the accuracy of the test files to make sure they are as close as possible to the ones used in the shared task.

To get the accuracy on the primary / mixed model cases run get_accuracy.sh
To get the accuracy on the PUD + cases run accuracy_low_resource.sh
To call a script to display the parse results of all the files run display results.sh
Finally, to remove new treebanks which were released in v2.2 and were not part of the 2018 shared task run remove_new_tbs.sh.

Work in progress

Things which still need to be done:

Ensure we are using the same params used by UDPipe for the 2018 shared task.
Ensure we are using the same models used by UDPipe the 2018 shared task, e.g. for the following cases:

Czech PUD ← Czech PDT
English PUD ← English EWT
Finnish PUD ← Finnish TDT
Japanese Modern ← Japanese GSD
Swedish PUD ← Swedish Talbanken
Mixed model for all other cases with no training data.

jbrry / ud_folder

ud_folder

Introduction

Requirements

Baseline models

Instructions

Work in progress

Useful references and acknowledgements

About

Languages