erayyildiz / Morpheus

Contextual Lemmatization and Morphological Tagging in 100 different languages. A Participant System for SigMorphon2019 Task 2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Morpheus: A Neural Network for Jointly Learning Contextual Lemmatization and Morphological Tagging

Contextual Lemmatization and Morphological Tagging in 108 different languages. A Participant System for SigMorphon2019 Task 2

Introduction

Morpheus is a joint contextual lemmatizer and morphological tagger which is based on a neural sequential architecture where inputs are the characters of the surface words in a sentence and the outputs are the minimum edit operations between surface words and their lemmata as well as the morphological tags assigned to the words.

Morpheus does not rely on any language specific settings so it is able to run on any language without any effort. According the results in SigMorphon 2019 Task 2, Morpheus performs comparable to current state-of-the-art systems for both lemmatization and morphological tagging tasks in nearly 100 languages. Morpheus has placed 3rd in lemmatization and reached the 9th place in morphological tagging among all participant teams.

The experiments show that predicting edit actions instead of characters in the lemmata is notably better, not only for lemmatization, but for tagging, as well. The improvements especially in low resource languages are significant.

The achitecture of Morpheus

Architecture

Reference

Please cite the following work if you use the tool.

Eray Yildiz and A. Cuneyd Tantug. 2019. Morpheus: A Neural Network for Jointly Learning Contextual Lemmatization and Morphological Tagging. In Proceedings of the 16th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Florence, Italy. Association for Computational Linguistics

The paper can be found in ACL anthology: https://www.aclweb.org/anthology/W19-4205 Have a look at SIGMORPHON 2019 Proceedings for all proposed methods in the workshop.

Datasets

The data is owes its provenance to the Universal Dependencies project and have been converted to the UniMorph schema.

Sentences are annotated in the ten-column CoNLL-U format.

  • The ID column gives each word a unique ID within the sentence.
  • The FORM column gives the word as it appears in the sentence.
  • The LEMMA column contains the form’s lemma.
  • The FEATS column contains morphosyntactic features in the UniMorph schema.
  • 6 remaining columns are nulled out and replaced with undercore (‘_’)

At prediction time, test data also null out the LEMMA and FEATS columns.

Check the UniMorph dataset collection which includes datasets for more than 100 languages.

Usage

Requirements

You can use any computer with Python 3 installed. We strongly recommend you to use a machine with a GPU if you want to train models. To install dependencies, just install the packages written in requirements.txt as follow:

pip install -r requirements.txt

Training

To train joint a contextual lemmatizer and morphological tagger for a language, run the following script in your command line.

train.py -l Turkish-IMST -t ../data/2019/task2/UD_Turkish-IMST/tr_imst-um-train.conllu 
-d ../data/2019/task2/UD_Turkish-IMST/tr_imst-um-dev.conllu -m my_model

This code will train a model for Turkish language.

The options of the script as follow:

Options:
  -h, --help            show this help message and exit
  --all, --all          use if you want to train models for all languages in
                        UniMorph dataset
  -l LANGUAGE_NAME, --language_name=LANGUAGE_NAME
                        The name of the language
  -t TRAIN_FILE, --train_file=TRAIN_FILE
                        CONLL file path for training
  -d DEV_FILE, --dev_file=DEV_FILE
                        CONLL file path for validation
  -m MODEL_NAME, --model_name=MODEL_NAME
                        Name for the model

After placing the UniMorph datasets into the data directory, you can simply run following command to train models for all languages in the directory.

train.py --all -m my_model

After training has been completed, the following files are created in the same directory as the data:

  • {{language_name}}.{{model_name}}.dataset
  • {{language_name}}.{{model_name}}.encoder.model
  • {{language_name}}.{{model_name}}.decoder_lemma.model
  • {{language_name}}.{{model_name}}.decoder_morph.model

All of the created files will be used for prediction.

Prediction

To run a model on a dataset, you can use the predict.py script as follow:

predict.py -i input_conll_file -o output_conll_file -d dataset_obj_path
-e encoder_model_file -l lemma_decoder_model_file -m morph_decoder_model_file
Options:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file=INPUT_FILE
                        Input CONLL file path
  -o OUTPUT_FILE, --output_file=OUTPUT_FILE
                        Output file path
  -d DATASET_OBJ_FILE, --dataset_obj_file=DATASET_OBJ_FILE
                        The path of the dataset object which is saved during
                        training process
  -e ENCODER_FILE, --encoder_file=ENCODER_FILE
                        The path of the encoder object which is saved during
                        training process
  -l LEMMA_DECODER_FILE, --lemma_decoder_file=LEMMA_DECODER_FILE
                        The path of the lemma decoder object which is saved
                        during training process
  -m MORPH_DECODER_FILE, --morph_decoder_file=MORPH_DECODER_FILE
                        The path of the morph decoder object which is saved
                        during training process

Note that the files DATASET_OBJ_FILE, LEMMA_DECODER_FILE and MORPH_DECODER_FILE are created during training The input file must be in conll format which is tab separated and the second column contains surface words. The other columns are not imported and will be ignored. The output file is also in conll format where third column contains lemmata and sixth column contains morphological tags.

Experimental Results

Method Lemmatization Accuracy (%) Morphological Tagging F1 Score (%)
Turku NLP (Kanerva et al., 2018) 92.18 86.7
UPPSALA Uni. (Moor, 2018) 58.5 88.32
SigMorphon 2019 Baseline (Malaviya et al., 2019) 93.95 68.72
Morpheus (Character Prediction) 88.03 88.94
Morpheus (Edit Operation Prediction) 94.15 90.52
Language Dataset Size Lemmatization Morphological Tagging
Character Prediction Model Edit Prediction Model Character Prediction Model Edit Prediction Model
North-Sami-Giella 29K 87.53 91.90 88.89 92.83
French-GSD 360K 97.06 98.47 97.58 97.99
Japanese-Modern 14K 85.39 93.88 93.06 92.44
Swedish-PUD 18K 82.79 93.05 89.23 92.09
Arabic-PADT 256K 94.39 95.18 95.01 95.40
Basque-BDT 119K 95.42 96.49 93.06 94.47
Urdu-UDTB 123K 95.20 96.02 90.79 91.20
Irish-IDT 21K 85.07 89.23 80.60 71.52
Bambara-CRB 14K 88.24 88.85 93.47 93.56
Dutch-Alpino 200K 94.97 97.81 95.63 96.45
Czech-FicTree 175K 97.39 98.49 94.15 96.39
Danish-DDT 94K 93.16 97.26 94.17 95.62
Latin-ITTB 332K 98.65 98.75 96.84 97.34
French-Sequoia 64K 95.54 98.17 95.96 96.82
Italian-PoSTWITA 115K 92.71 96.61 94.43 95.62
Polish-SZ 93K 93.59 96.86 90.23 93.26
Czech-CLTT 32K 92.11 98.03 89.03 93.82
Cantonese-HK 7K 90.05 94.17 85.41 86.14
Galician-TreeGal 23K 89.68 95.19 89.78 90.72
Slovenian-SSJ 131K 95.25 96.94 93.47 95.79
French-ParTUT 25K 92.67 96.10 93.09 94.55
Lithuanian-HSE 5K 70.60 83.05 43.37 70.70
French-Spoken 35K 94.47 98.48 95.46 96.66
Russian-Taiga 22K 83.59 90.57 76.62 83.80
Latvian-LVTB 150K 94.29 96.22 93.51 95.87
Czech-PDT 1515K 84.86 98.41 87.65 95.27
Japanese-GSD 168K 95.21 98.91 95.35 95.61
Indonesian-GSD 111K 97.06 99.49 92.69 93.11
Gothic-PROIEL 62K 96.60 96.58 93.04 95.12
Latin-PROIEL 219K 96.31 96.37 93.75 95.05
Czech-PUD 19K 83.55 93.57 81.30 86.70
Dutch-LassySmall 96K 93.44 97.58 94.51 95.47
Romanian-RRT 198K 96.54 97.88 96.81 97.44
Korean-Kaist 346K 93.31 95.07 95.70 95.46
Amharic-ATT 11K 93.80 100.00 91.02 91.39
English-GUM 79K 95.58 97.85 93.92 95.48
Estonian-EDT 421K 93.10 96.27 95.64 96.70
Chinese-GSD 111K 95.22 99.15 89.25 90.78
Korean-GSD 80K 87.55 92.89 93.43 94.16
Marathi-UFAL 4K 74.59 76.94 68.26 69.26
Akkadian 2K 42.22 60.89 78.13 66.52
Faroese-OFT 13K 83.56 89.97 88.08 89.49
English-EWT 246K 96.78 98.11 95.61 95.95
Sanskrit-UFAL 3K 53.61 65.98 52.59 55.36
Turkish-IMST 60K 94.13 96.43 91.67 93.72
English-PUD 20K 89.40 95.22 88.88 89.89
Korean-PUD 18K 87.19 98.86 91.42 92.75
Finnish-PUD 16K 77.72 87.55 85.49 92.05
Russian-SynTagRus 1036K 95.31 97.76 94.99 95.13
Croatian-SET 179K 94.91 96.01 94.31 95.47
Tagalog-TRG 406 48.00 84.00 74.23 71.74
Slovenian-SST 31K 91.83 94.97 85.34 89.23
Finnish-FTB 172K 90.70 94.46 94.13 95.85
Polish-LFG 174K 93.85 96.09 92.93 95.35
Portuguese-Bosque 218K 96.43 97.86 96.07 96.59
Coptic-Scriptorium 20K 93.47 95.68 95.17 94.76
Chinese-CFL 7K 82.55 92.66 81.51 83.76
Spanish-AnCora 497K 98.32 98.92 98.29 98.46
Greek-GDT 57K 93.73 96.65 94.71 96.12
Serbian-SET 78K 94.82 97.06 94.36 96.06
Naija-NSC 14K 95.80 99.84 91.15 92.02
Vietnamese-VTB 42K 98.17 99.95 89.45 89.71
Yoruba-YTB 2K 83.60 97.20 80.49 70.67
Italian-PUD 22K 89.51 96.11 92.63 94.22
Finnish-TDT 198K 91.37 95.38 95.67 96.76
English-ParTUT 44K 94.87 97.85 92.32 93.46
Upper-Sorbian-U. 11K 77.79 90.74 69.47 77.46
Norwegian-Ny. 14K 93.89 97.42 90.45 92.20
Galician-CTG 121K 97.18 98.69 97.29 97.30
Old-Church-Slv. 66K 96.48 95.66 93.33 94.91
Russian-GSD 92K 92.90 91.51 91.98 93.91
Kurmanji-MG 10K 85.66 92.69 85.99 85.22
Norwegian-Bk. 299K 96.65 98.94 96.75 97.41
Italian-ISDT 273K 96.90 97.90 97.34 97.89
Komi-Zyrian-IKDP 1K 38.55 68.67 45.50 36.89
Hebrew-HTB 144K 96.49 97.35 95.35 95.70
Tamil-TTB 10K 86.77 96.10 83.07 88.50
Buryat-BDT 10K 83.33 89.61 78.24 82.30
Breton-KEB 12K 85.61 92.81 88.65 90.12
Latin-Perseus 29K 87.30 86.26 79.21 82.88
Romanian-Nonstd 189K 96.10 96.37 95.52 96.36
Italian-ParTUT 50K 94.65 97.44 94.83 96.30
Catalan-AnCora 481K 98.17 98.92 98.42 98.65
Arabic-PUD 22K 81.31 80.90 87.23 88.00
Komi-Zyrian-L. 2K 52.75 77.47 55.79 57.02
Japanese-PUD 25K 86.30 97.32 93.46 94.02
Slovak-SNK 119K 94.52 96.95 91.88 94.55
Ukrainian-IU 118K 93.63 96.80 91.24 93.68
Turkish-PUD 17K 78.20 89.19 86.71 91.28
Bulgarian-BTB 152K 95.98 97.58 97.16 97.83
Russian-PUD 19K 83.78 92.54 81.73 87.79
Belarusian-HSE 8K 78.06 89.87 69.39 71.95
Hindi-HDTB 322K 98.15 98.82 96.12 96.60
Czech-CAC 474K 98.01 98.86 96.52 97.54
Hungarian-Szeged 38K 87.89 95.26 89.63 91.65
Swedish-LinES 74K 93.52 96.82 93.25 94.88
Afrikaans-Af.B. 45K 93.75 98.74 95.08 95.96
English-LinES 77K 96.19 98.27 94.57 95.43

Acknowledgement

This work is carried out by Eray Yildiz and A. Cuneyd Tantug in Istanbul Technical University. For questions: yildiz17@itu.edu.tr

We would like to thank the SigMorphon 2019 organizers for the great effort and the reviewers for the insightful comments.

About

Contextual Lemmatization and Morphological Tagging in 100 different languages. A Participant System for SigMorphon2019 Task 2

License:MIT License


Languages

Language:Jupyter Notebook 90.5%Language:Python 9.5%