LCFRS Supertag Parser

This projects implements an extraction of lexical lcfrs rules from corpora of constituent trees, training of lexical rule prediction (supertagging) and parsing using lexical rules. The parts concerning grammars (extraction, parsing) were implemented as an extension of disco-dop, the prediction was implemented using the flair framework for neural nlp.

Build

The project was developed and tested using python 3.8. We strongly recommend using a conda (or virtualenv) environment when running it:

conda create -n lcfrs-supertagger python=3.8 gcc_linux-64=9.3 gxx_linux-64=9.3 && conda activate lcfrs-supertagger
# or virtualenv: virtualenv venv && ./venv/bin/activate

Build and install all dependencies, including our fork of disco-dop:

# the following should be skipped if you did not use git to obain these files
git submodule update --init --recursive
pip install cython `cat requirements.txt` && pip install ./disco-dop/

Usage

A call of

python prepare-data.py <data.conf>

constructs a lexical grammar, supertags and splits the corpus as specified in <data.conf>. There are already some configuration files prepared for Negra, Tiger and discontinuous Penn Treebank (contact Kilian Evang) in ./config/corpus/. However, they require these corpora lying in ./copora/. Depending on the format of the downloaded corpora, you may have to adjust the fields inputfmt (export vs. bracket vs. tiger) and inputenc (iso-8859-1 vs. utf-8) in the configuration. The script writes files to the folder specified in the filename field of configuration files.

python training.py <data.conf> <model.conf>

trains a sequence tagger as specified in <model.conf> using the prepared data. Existing configurations are bilstm-model.conf, bert-model.conf, supervised-small.conf and supervised-large.conf in ./config/model/. During training, the script writes checkpoints to a newly created folder named after <data.conf>, <model.conf> and the current date. Training is monitored using tensorboard, also there are lists of losses and scores in loss.tsv.

When the training finishes, the model is automatically evaluated on the test set and parsing scores are reported. You may repeat this evaluation (eg. with changed evaluation parameters or changed data) by calling

python evaluate.py <data.conf> <model>

where <model> is a model file from the checkpoint folder (eg. trained-corpus-model-date/best-model.pt).

Calling

python prepare-data.py config/corpus/example.conf
python training.py config/model/example.conf config/corpus/example.conf

should work out-of-the-box, as it uses a small (publicly available) sample of the alpino corpus distributed with disco-dop. Comments on the configuration files can be found in config/model/example.conf and config/corpus/example.conf.

Tiger corpus

This treebank needs a speacial treatment, because some nodes in the treebank are linked to multiple parents. The issue is solved by removing overfluous links (cf. https://github.com/mcoavoux/multilingual_disco_data/blob/master/generate_tiger_data.sh):

sed -e "3097937d;3097954d;3097986d;3376993d;3376994d;3377000d;3377001d;3377002d;3377008d;3377048d;3377055d" tiger.xml > tiger-fixed.xml

truprecht / lcfrs-supertagger

LCFRS Supertag Parser

Build

Usage

Tiger corpus

About

Languages