360er0 / COMBO

COMBO is jointly trained tagger, lemmatizer and dependency parser.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

COMBO

COMBO is jointly trained neural tagger, lemmatizer and dependency parser implemented in python 3 using Keras framework. It took part in 2018 CoNLL Universal Dependency shared task and ranked 3rd/4th in the official evaluation.

Paper

The COMBO description can be found here: Semi-Supervised Neural System for Tagging, Parsing and Lematization.

Usage

Training your own model:

python main.py --mode autotrain --train train_data.conllu --valid valid_data.conllu --embed external_embedding.txt --model model_name.pkl --force_trees

Making predictions:

python main.py --mode predict --test test_data.conllu --pred output_path.conllu --model model_name.pkl

Trained models

Models trained on UD dataset:

Language Treebank LAS MLAS BLEX Model
Afrikaans af_afribooms 84.72 72.91 74.98 377 MB
Ancient Greek grc_perseus 74.20 53.30 54.29 101 MB
Ancient Greek grc_proiel 76.45 59.95 67.47 101 MB
Arabic ar_padt 71.95 62.75 64.38 737 MB
Armenian hy_armtdp 28.15 5.02 11.25 738 MB
Basque eu_bdt 83.12 68.82 77.96 737 MB
Bulgarian bg_btb 89.36 81.10 79.98 738 MB
Buryat bxr_bdt 15.16 1.09 1.92 90 MB
Catalan ca_ancora 90.54 83.11 85.20 737 MB
Chinese zh_gsd 63.92 53.48 57.84 744 MB
Croatian hr_set 86.32 71.12 79.74 737 MB
Czech cs_cac 90.72 83.27 86.69 740 MB
Czech cs_fictree 91.83 84.23 87.81 740 MB
Czech cs_pdt 90.34 84.04 86.96 740 MB
Danish da_ddt 83.43 74.22 77.58 737 MB
Dutch nl_alpino 87.15 74.93 77.06 737 MB
Dutch nl_lassysmall 84.27 72.65 75.44 737 MB
English en_ewt 82.31 73.33 76.52 737 MB
English en_gum 82.82 73.24 73.57 737 MB
English en_lines 80.33 72.25 74.01 737 MB
Estonian et_edt 83.46 75.79 72.07 738 MB
Finnish fi_ftb 86.89 78.42 81.06 739 MB
Finnish fi_tdt 85.93 78.65 72.39 739 MB
French fr_gsd 85.42 77.08 79.72 738 MB
French fr_sequoia 88.99 81.48 84.67 738 MB
French fr_spoken 74.31 63.43 65.34 738 MB
Galician gl_ctg 81.17 68.15 73.60 736 MB
Galician gl_treegal 73.21 52.88 62.86 736 MB
German de_gsd 77.43 54.28 68.59 738 MB
Gothic got_proiel 65.87 50.81 59.30 48 MB
Greek el_gdt 88.49 76.15 78.57 738 MB
Hebrew he_htb 63.69 50.26 53.58 737 MB
Hindi hi_hdtb 91.43 76.23 86.29 593 MB
Hungarian hu_szeged 79.47 66.09 72.51 737 MB
Indonesian id_gsd 78.40 67.30 75.10 737 MB
Irish ga_idt 69.24 37.31 47.32 206 MB
Italian it_isdt 91.03 83.18 84.76 737 MB
Italian it_postwita 73.99 61.14 62.98 737 MB
Japanese ja_gsd 73.69 57.82 60.62 743 MB
Kazakh kk_ktb 22.38 4.40 7.86 738 MB
Korean ko_gsd 80.66 74.49 66.13 741 MB
Korean ko_kaist 84.88 76.92 72.40 743 MB
Kurmanji kmr_mg 21.95 2.26 05.01 45 MB
Latin la_ittb 85.54 79.84 83.51 526 MB
Latin la_perseus 68.07 49.77 52.75 526 MB
Latin la_proiel 70.08 56.82 64.94 526 MB
Latvian lv_lvtb 80.71 66.22 71.80 637 MB
North Sámi sme_giella 57.16 39.66 45.03 47 MB
Norwegian no_bokmaal 89.33 79.51 84.68 737 MB
Norwegian no_nynorsk 88.36 79.32 82.89 737 MB
Norwegian no_nynorsklia 68.26 57.51 60.98 737 MB
Old Church Slavonic cu_proiel 71.14 56.52 66.04 48 MB
Old French fro_srcmf 84.81 76.75 81.20 52 MB
Persian fa_seraji 86.14 80.30 76.29 737 MB
Polish pl_lfg 94.62 86.44 89.31 737 MB
Polish pl_sz 91.38 80.45 85.59 737 MB
Polish poleval2018 86.11 76.18 79.86 115 MB
Portuguese pt_bosque 87.57 74.31 80.31 737 MB
Romanian ro_rrt 85.31 76.84 79.54 737 MB
Russian ru_syntagrus 91.10 85.37 87.16 741 MB
Russian ru_taiga 74.24 61.59 64.36 741 MB
Serbian sr_set 87.27 73.79 79.92 738 MB
Slovak sk_snk 83.76 63.97 75.34 54 MB
Slovenian sl_ssj 85.72 75.07 81.11 737 MB
Slovenian sl_sst 58.12 45.93 50.94 737 MB
Spanish es_ancora 89.68 82.60 84.51 737 MB
Swedish sv_lines 81.97 66.26 77.01 737 MB
Swedish sv_talbanken 85.89 77.68 80.74 737 MB
Turkish tr_imst 63.54 52.51 58.89 737 MB
Ukrainian uk_iu 84.71 69.88 77.97 738 MB
Upper Sorbian hsb_ufal 21.30 1.45 4.53 139 MB
Urdu ur_udtb 81.53 55.70 72.49 485 MB
Uyghur ug_udt 63.10 40.71 52.76 165 MB
Vietnamese vi_vtb 42.53 35.11 38.47 736 MB

License

CC BY-NC-SA 4.0

Citation

@InProceedings{rybak-wrblewska:2018:K18-2,
  author    = {Rybak, Piotr  and  Wr{\'{o}}blewska, Alina},
  title     = {Semi-Supervised Neural System for Tagging, Parsing and Lematization},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {45--54},
  url       = {http://www.aclweb.org/anthology/K18-2004}
}

About

COMBO is jointly trained tagger, lemmatizer and dependency parser.


Languages

Language:Python 100.0%