Morphological Tagging and Lemmatization in Context

MSc AI Thesis @ University of Amsterdam

New

UDIFY models added

Evaluation procedure

Added documentation & thesis chapter

To Do

Dump additional information in ./morphological_tagging/README.md

This project has been supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 825299 (GoURMET).

This repo holds a collection of utilities and scripts for building CoNLL/UniMorph corpora, training joint morphological taggers and lemmatizers, evaluating trained models and converting models into performant pipelines.

We replicate the winning systems from the 2019 SIGMORPHON/CONLL Shared Task 2^[1], extended to the latest UD treebanks and the the UniMorph tagging schema^[2]. We further contribute a competitive architecture of our own. We further extend the evaluation method, and show robust performance on many language types for both known and unknown word-forms/lemmas.

Uses a single, consistent PyTorch framework. All models are easy to train and test, and once trained, to use.

Installation

This repo contains code for both development and inference. The latter requires far fewer external dependencies than the former.

Clone this repo, e.g.

git clone https://github.com/ioverho/morph_tag_lemmatize.git
cd morph_tag_lemmatize

Build the Anaconda environment

conda env create -f env.yaml # For inference only
conda env create -f env_development.yaml # For inference & dev
conda activate morph_tag_lemmatize

Download needed pipeline checkpoints or datasets

Tested on Windows 11, Debian Linux 4.19.0 and Ubuntu 20.04.5 (through WSL2).

Inference

All pre-trained models (including dictionaries to allow mapping from raw text to input and output to processed text) can be found in the Google Drive folder here. The languages chosen are meant to represent a broad set of morphologically interesting European languages, and is by no means complete. Detailed below is how to train new models on other languages.

Basic Usage

Pipelines were designed to contain all logic necessary for tokenization, collating batches and converting output on top of pretrained PyTorch models: in short, sentences go in, lemmas and morph. tags come out. The main aim is out-of-the-box ease of use.

Currently implemented pipelines:

UFAL Prague's UDPipe2^{[4, 5, 6]}
Dan Kondratyuk's UDIFY^{[8, 9]}
Our own CANINE^[7] based DogTag

Both UDIFY and DogTag support multilingual pre-training. This can provide a modest performance boost, at the cost of significantly more expensive training. In the saved model checkpoints, mono indicates training only on the target language (i.e. no pre-training), and multi indicates first training on all languages from the same typological family before fine-tuning on the target language.

When to use what

All models perform roughly the same, overall. UDIFY works best, especially with multilingual pre-training, even on lower resource languages. DogTag is a very strong lemmatizer, but lags somewhat on morphological tagging. It also does not benefit from multilingual pre-training. UDPipe tends to perform worse.

In case memory or speed constraints are in place:

Memory: UDPipe requires loading in both word and contextual (i.e. a BERT variant) embeddings. These dominate memory used. DogTag requires only loading in a smaller transformer, CANINE. For both file and RAM usage, DogTag is significantly slimmer (~1.5 GB).
Inference Speed: CANINE operates at the character level, resulting in far larger input strings. UDIFY operates at the BPE level, but uses two, separate LSTMs for each task. As such, UDPipe has higher troughput. However, all models are reasonably fast on both CPU and GPU.
Training Speed: UDPipe requires finetuning a relatively small number of parameters on top of a lot of pre-trained modules. Training is much faster than other implemented models.

Command-line Interface

The tag_file.py script allows you to quickly tag a text file of sentences into some other format containing lemmas and morphological features. To tag a file, the path and language (entered as natural text) must be provided, e.g. on CPU:

python tag_file.py \
  --file_path {$FILE_TO_TAG} \
  --language {$LANGUAGE} \
  --gpu 0

Additional command line options include:

  --file_path     the location of the text file
  --language      the language of the text
  --pipeline_dir  location of the pretrained pipelines. Defaults to './pipelines'
  --pipeline      pipeline checkpoint name in `pipeline_dir`, must contain architecture
  --gpu           whether to annotate on GPU, if available
  --batch_size    number of lines being fed into the pipeline
  --encoding      encoding of text file
  --output_format {single_pickle_file,separate_text_files,single_jsonlines_file} output format

Python API

Once created, the pipelines can be saved and loaded without needing to point to a dataset class or a model checkpoint. The checkpoint files contain only a minimal subset of the parameters needed for the pipeline, and are thus smaller than the full model at run time.

from morphological_tagging.pipelines import UDPipe2Pipeline, UDIFYPipeline, DogTagPipeline

pipeline = PipelineClass.load(save_loc, map_location=device)

To use, simply feed in a list of strings, or a list of lists if tokenization occurs outside of tagger.

# If tokenizer is provided
pipeline(List[str])

# If tokenization is performed already
pipeline(List[List[str]], is_pre_tokenized=True)

# If sampling from TreebankDataModule
pipeline(Tuple[Union[List, torch.Tensor]], is_batch_input=True)

Default output is a tuple of lists containing, in order, the predicted lemmas, lemma scripts, morphological tags and categories. To change to a per token collection, use the transpose argument in the forward call:

pipeline(List[str], transpose=True) -> List[Tuple[lemma, lemma_script, morph_tags, morph_cats], ...]

For more details regarding pipeline creation, see here.

Training & Evaluation

Datasets

The files used for the SIGMORPHON/CONLL Shared Task 2019 are in CONLL-U format, except with the features column automatically converted^[3] to UniMorph tagsets^[2]. The original data files can be found here.

Since the competition, the UD treebanks have seen 6 new releases, with improvements and extensions made to many corpora. We have similary converted UD2.9 to carry UniMorph tags using the ud-compatibility repo. Datasets with existing train/valid/test splits can be found in the Google Drive.

While many more languages are available, we identified 38 which contain high quality annotations and enough samples for succesful training. Dataset size and diversity still varies considerably. For more details regarding dataset creation and usage, see here.

Treebanks can be created via the build_treebank_corpus.py file. It looks for all datasets from a language, and merges those that meet certain criteria. Finally, batches are created for fast training. An annotated config file can be found in ./morphological_tagging/config/treebank_corpus.yaml.

Models & Training

An early design choice was to opt for seq-first batching for UDPipe2, but batch-first for others. This make dataset files incompatible between models, unfortunately.

For details, see here. Generally, for deep learning experiments no reproduction is exact, and this project is no exception. Differences are detailed for each model in their respective section, ordered by expected impact (largest to smallest). Furthermore, test set performance is reported.

All training is conducted through the train_tagger.py script. A config file needs to be supplied to Hydra when calling through CLI. For example,

python train_tagger.py \
  --config-name udify_experiment \
  ++trainer precision=16 \
  gpu=1 \
  hydra/job_logging=disabled \
  hydra/hydra_logging=disabled

trains a UDIFY model on the default dataset using half-precision on GPU. The last two lines disable hydra job-logging, which is strongly recommended is using a third-party logger like wandb or tensorboard. Additional configuration options can be found in the respective /config/model and /config/data directories, with some default values taken from /config/default_train.yaml.

Evaluating

To evaluate a trained model (checkpoint in ./morphological_tagging/checkpoints/ on a pre-defined dataset stored in ./morphological_tagging/data/corpora, run the evaluate_tagger.py script. The corresponding configuration file can be found under ./morphological_tagging/config/eval.yaml.

For example, to evaluate a pretrained model MODEL on a dataset of LANGUAGE/TREEBANKNAME combination on GPU:

python evaluate_tagger.py \
  ++model_name={$MODEL} \
  ++dataset_name={$LANGUAGE}_{$TREEBANKNAME} \
  gpu=1 \
  hydra/job_logging=disabled hydra/hydra_logging=disabled

It will automatically search for the most recent version model available.

The script outputs an pickle file containing tuples of:

(token, lemma, predicted lemma, lemma script, predicted lemma script, morphological tags, predicted morphological tags, whether token is present in vocab, whether lemma is present in vocab)

which can be used for analysis of model behaviour. For an example, see evaluation.ipynb. Results from this notebook are also used in the corresponding thesis chapter. This repo comes with many eval files already produced, see ./eval/

The script also conveniently builds a pipeline object from the checkpoint (saved in ./pipelines). The pipeline contains the aggregated performance stats printed at the end of this script.

References

1: McCarthy, A. D., Vylomova, E., Wu, S., Malaviya, C., Wolf-Sonkin, L., Nicolai, G., Kirov, C., Silfverberg, M., Mielke, S. J., Heinz, J., Cotterell, R. & Hulden, M. (2019). The SIGMORPHON 2019 shared task: Morphological analysis in context and cross-lingual transfer for inflection. arXiv preprint arXiv:1910.11493.

2: Sylak-Glassman, J. (2016). The composition and use of the universal morphological feature schema (unimorph schema). Johns Hopkins University.

3: McCarthy, A. D., Silfverberg, M., Cotterell, R., Hulden, M., & Yarowsky, D. (2018). Marrying universal dependencies and universal morphology. arXiv preprint arXiv:1810.06743.

4: Straka, M. (2018, October). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 197-207).

5: Straka, M., Straková, J., & Hajič, J. (2019). UDPipe at SIGMORPHON 2019: Contextualized embeddings, regularization with morphological categories, corpora merging. arXiv preprint arXiv:1908.06931.

6: Straka, M., & Straková, J. (2020). UDPipe at EvaLatin 2020: Contextualized embeddings and treebank embeddings. arXiv preprint arXiv:2006.03687.

7: Clark, J. H., Garrette, D., Turc, I., & Wieting, J. (2022). Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10, 73-91.

8: Kondratyuk, D., & Straka, M. (2019). 75 languages, 1 model: Parsing universal dependencies universally. arXiv preprint arXiv:1904.02099.

9: Kondratyuk, D. (2019, August). Cross-lingual lemmatization and morphology tagging with two-stage multilingual BERT fine-tuning. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 12-18).

ioverho / morph_tag_lemmatize