SlavicaJ / babushka-bench

Benchmarking NLP tools on Slovene, Croatian and Serbian

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Benchmarking NLP tools on Slovene, Croatian, Serbian and Bulgarian

For now, the following processing levels are present in the repo:

Segmentation

Tokens

tool revision parameters dataset language P R F1
reldi-tokeniser fb85138 -l sl ssj500k sl 99.68 99.18 99.43
Obeliks4J 32266e7 ssj500k default sl 99.98 99.98 99.98
reldi-tokeniser fb85138 -l hr hr500k hr 99.57 99.55 99.56
reldi-tokeniser fb85138 -l sr SETimes.SR sr 99.92 99.97 99.94

Words

Will come later when tagging is included?

Sentences

tool revision parameters dataset language P R F1
reldi-tokeniser fb85138 -l sl ssj500k sl 97.85 96.49 97.17
Obeliks4J 32266e7 default ssj500k sl 99.09 99.26 99.18
reldi-tokeniser fb85138 -l hr hr500k hr 90.64 93.45 92.02
reldi-tokeniser fb85138 -l sr SETimes.SR sr 97.45 95.92 96.68

Morphosyntactic tagging

tool revision comment segmentation dataset language P R F1
reldi-tagger 994f746 gold ssj500k sl 94.21 94.21 94.21
Obeliks gold ssj500k  sl 92.67 92.67 92.67
meta-tagger gold ssj500k sl 94.34 94.34 94.34
Parser-v3 9ee9e8f CLARIN.SI FT embeddings gold ssj500k sl 96.58 96.58 96.58
Parser-v3 9ee9e8f CLARIN.SI FT embeddings Obeliks4J ssj500k sl 96.56 96.55 96.56
Parser-v3 9ee9e8f CLARIN.SI FT embeddings reldi-tokeniser ssj500k sl 96.39 96.35 96.37
stanfordnlp 828ef2e CoNLL17 embeddings gold ssj500k sl 96.45 96.45 96.45
stanfordnlp 828ef2e CLARIN.SI FT embeddings gold ssj500k sl 96.72 96.72 96.72
stanfordnlp 828ef2e CLARIN.SI W2V embeddings gold ssj500k sl 96.79 96.79 96.79
stanfordnlp 828ef2e CLARIN.SI FT embeddings gold ssj500k_ud sl 95.65 95.65 95.65
classla-stanfordnlp 2c41295 CLARIN.SI FT embeddings  gold ssj500k sl 97.06 97.06 97.06
reldi-tagger 994f746 gold hr500k hr 91.91 91.91 91.91
Parser-v3 9ee9e8f CLARIN.SI FT embeddings gold hr500k hr 94.29 94.29 94.29
Parser-v3 9ee9e8f CLARIN.SI FT embeddings reldi-tokeniser hr500k hr 93.89 93.86 93.87
stanfordnlp 828ef2e CoNLL17 embeddings gold hr500k hr 93.85 93.85 93.85
stanfordnlp 828ef2e CLARIN.SI FT embeddings gold hr500k hr 94.13 94.13 94.13
stanfordnlp 828ef2e CLARIN.SI W2V embeddings gold hr500k hr 94.18 94.18 94.18
stanfordnlp 828ef2e CLARIN.SI FT embeddings gold hr500k_ud hr 94.60 94.60 94.60
reldi-tagger 994f746 gold SETimes.SR sr 92.03 92.03 92.03
Parser-v3 9ee9e8f CLARIN.SI FT embeddings gold SETimes.SR sr 95.12 95.12 95.12
Parser-v3 9ee9e8f CLARIN.SI FT embeddings reldi-tokeniser SETimes.SR sr 95.07 95.12 95.10
stanfordnlp 828ef2e CoNLL17 (Croatian) embeddings gold SETimes.SR sr 94.78 94.78 94.78
stanfordnlp 828ef2e CLARIN.SI FT (Croatian) embeddings gold SETimes.SR sr 94.69 94.69 94.69
stanfordnlp 828ef2e CLARIN.SI FT (Serbian) embeddings gold SETimes.SR sr 95.23 95.23 95.23
stanfordnlp  828ef2e CLARIN.SI W2V (Serbian) embeddings gold SETimes.SR sr 94.91 94.91 94.91
classla-stanfordnlp 2c41295 CoNLL17 embeddings gold BTB bg 96.77 96.77 96.77

Lemmatization

tool revision comment preprocessing dataset language P R F1
reldi-tagger 994f746 gold ssj500k sl 99.46 99.46 99.46
reldi-tagger 994f746 gold segmentation, reldi-tagger ssj500k sl 98.35 98.35 98.35
reldi-tagger 994f746 gold segmentation, stanfordnlp ssj500k sl 98.77 98.77 98.77
Obeliks gold segmentation, Obeliks ssj500k sl 98.19 98.19 98.19
meta-tagger gold segmentation, meta-tagger ssj500k sl 98.66 98.66 98.66
stanfordnlp 828ef2e gold ssj500k sl 97.75 97.75 97.75
stanfordnlp 828ef2e gold segmentation, stanfordnlp ssj500k sl 97.51 97.51 97.51
classla-stanfordnlp gold ssj500k sl 99.63 99.63 99.63
classla-stanfordnlp gold segmentation, stanfordnlp ssj500k sl 99.02 99.02 99.02
reldi-tagger 994f746 gold hr500k hr 98.17 98.17 98.17
reldi-tagger 994f746 gold segmentaton, reldi-tagger hr500k hr 96.82 96.82 96.82
reldi-tagger 994f746 gold segmentation, stanfordnlp hr500k hr 97.22 97.22 97.22
stanfordnlp 828ef2e gold hr500k hr 96.22 96.22 96.22
stanfordnlp 828ef2e gold segmentation, stanfordnlp hr500k hr 95.85 95.85 95.85
classla-stanfordnlp 56c7241 gold hr500k hr 98.57 98.57 98.57
classla-stanfordnlp 56c7241 gold segmentation, stanfordnlp hr500k hr 97.60 97.60 97.60
reldi-tagger 994f746 gold SETimes.SR sr 97.89 97.89 97.89
reldi-tagger 994f746 gold segmentation, reldi-tagger SETimes.SR sr 96.44 96.44 96.44
reldi-tagger 994f746 gold segmentation, stanfordnlp SETimes.SR sr 97.26 97.26 97.26
stanfordnlp 828ef2e gold SETimes.SR sr 95.29 95.29 95.29
stanfordnlp 828ef2e gold segmentation, stanfordnlp SETimes.SR sr 95.18 95.18 95.18
classla-stanfordnlp 56c7241 gold SETimes.SR sr 98.49 98.49 98.49
classla-stanfordnlp 56c7241 gold segmentation, stanfordnlp SETimes.SR sr 97.89 97.89 97.89
classla-stanfordnlp 2c41295 gold segmentation, classla-stanfordnlp BTB bg 98.80 98.80 98.80

Parsing

tool revision comment preprocessing dataset language P R F1
classla-stanfordnlp 56c7241 gold segmentation, classla-stanfordnlp ssj500k sl 92.68 92.68 92.68
classla-stanfordnlp 56c7241 gold ssj500k sl 94.19 94.19 94.19
classla-stanfordnlp 56c7241 gold segmentation, classla-stanfordnlp hr500k hr 85.86 85.86 85.86
classla-stanfordnlp  56c7241 gold hr500k hr 86.64 86.64  86.64
classla-stanfordnlp 56c7241 gold segmentation, classla-stanfordnlp SETimes.SR sr 88.96 88.96 88.96
classla-stanfordnlp  56c7241 gold SETimes.SR sr 90.20 90.20  90.20
classla-stanfordnlp 2c41295 gold segmentation, classla-stanfordnlp BTB-UD bg 91.45 91.45 91.45

Named entity recognition

For named entity recognition, macro-F1 and accuracy are calculated on the token level, disregarding the B-/I- label prefixes.

tool revision comment preprocessing dataset language macro-F1 accuracy
janes-ner cf687e8 gold segmentation and tagging ssj500k sl 0.673 0.984
janes-ner cf687e8 gold segmentation and tagging hr500k hr 0.752 0.978
janes-ner cf687e8 gold segmentation and tagging SETimes.SR sr 0.781 0.975
simpletransformers ver 0.7.10 bert-base-multilingual-cased, 3 epochs, other default gold segmentation ssj500k sl 0.868 0.991
simpletransformers ver 0.7.10 bert-base-multilingual-cased, 3 epochs, other default gold segmentation hr500k hr 0.886 0.988
simpletransformers ver 0.7.10 bert-base-multilingual-cased, 3 epochs, other default gold segmentation SETimes.SR sr 0.911 0.989

About

Benchmarking NLP tools on Slovene, Croatian and Serbian


Languages

Language:Python 100.0%