🤗 + ⚛️ Fine-tuned Transformers compatible BERT models for Sequence Tagging

This repository contains fine-tuned Transformers compatible BERT models for sequence tagging tasks like NER or PoS tagging.

We use the (coming soon) fine-tuning NER example from the awesome Transformers repository.

Changelog

23.09.2019: PoS tagging results on German HDT added
21.09.2019: Initial release of this repo. Publicly available models are coming soon!

CoNLL datasets

For NER tasks, the CoNLL 2002 and 2003 datasets are used. The following languages are covered: German, English, Dutch and Spanish.

These models are trained with a max. sequence length of 128.

❔ But what happens, if a sentence in the dataset is longer than 128 subtokens? We simply split longer sentences into smaller ones. We do not ✂️ sentences, which would be bad for evaluation!

❔ What is current state-of-the-art for fine-tuned models? Well, we mainly compare our models to the following two papers:

"How multilingual is Multilingual BERT?" by Pires, Schlinger and Garrette (2019)
"Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT" by Wu and Dredze (2019)

❔ We use a max. sequence length of 128 and a batch size of 32. These are the same parameters as used by Pires, Schlinger and Garrette (2019).

Results

We report an averaged F1-score of 5 different runs. We use different seeds for each run. The official CoNLL evaluation script from here is used.

English

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
BERT base, cased (Dev)	95.13	95.29	95.07	95.12	95.53	95.23
BERT base, cased (Test)	90.89	90.76	90.82	91.09	91.60	91.03
BERT large, cased (Dev)	95.69	95.47	95.77	95.86	95.91	95.74
BERT large, cased (Test)	91.73	91.17	91.77	91.22	91.46	91.47

Pires, Schlinger and Garrette (2019) report a F1-score of 91.07% using the EN-BERT model.

German

Notice: For this experiment we use the original CoNLL-2003 data. The dataset also includes an 2006 update that fixes various MISC annotations. However, it turns out that most papers are not using the updated 2006 version of the dataset. So in this experiment, we use the "old" dataset in order to achieve a better comparison to other papers. See a more detailed discussion about this dataset in here.

We use three BERT models:

Multilingual BERT (base and cased), further called mBERT
German BERT (base and cased) from here
An own trained German BERT (base and cased), currently unreleased

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
mBERT base, cased (Dev)	86.04	85.64	86.04	85.10	83.16	85.20
mBERT base, cased (Test)	83.02	82.80	82.56	82.21	82.14	82.55
German BERT base, cased (Dev)	87.33	86.34	87.05	86.52	86.80	86.81
German BERT base, cased (Test)	83.84	83.98	83.62	83.70	83.37	83.70
Own German BERT base, cased (Dev)	86.66	86.75	87.06	86.61	87.22	86.86
Own German BERT base, cased (Test)	84.32	84.47	84.76	84.38	84.68	84.52
German DistilBERT (v0), cased (Dev)	86.58	86.43	86.19	86.60	86.44	86.45
German DistilBERT (v0), cased (Test)	83.33	83.11	83.30	83.55	84.18	83.49

Pires, Schlinger and Garrette (2019) report a F1-score of 82.00% for their fine-tuned multilingual BERT model, whereas Wu and Dredze (2019) report a F1-score of 82.82%.

For German DistilBERT a learning rate of 6e-5 was used.

Dutch

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
mBERT base, cased (Dev)	90.73	91.11	90.82	90.94	91.05	90.93
mBERT base, cased (Test)	90.94	90.29	90.10	90.32	90.27	90.38

Pires, Schlinger and Garrette (2019) report a F1-score of 89.86% for their fine-tuned multilingual BERT model, whereas Wu and Dredze (2019) report a F1-score of 90.94%.

Spanish

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
mBERT base, cased (Dev)	86.50	86.25	86.44	86.99	86.70	86.576
mBERT base, cased (Test)	87.80	87.93	87.92	87.00	87.53	87.636

Pires, Schlinger and Garrette (2019) report a F1-score of 87.18% for their fine-tuned multilingual BERT model, whereas Wu and Dredze (2019) report a F1-score of 87.38%.

GermEval 2014

We train three models for the GermEval 2014 shared task: mBERT, German BERT and our own German BERT model.

Notice: the original dataset includes some strange character control sequences (all labeled with "O" tag). We remove them in a pre-processing step (for train/dev/test sets).

Results

We report averaged F1-score over 5 different runs (with different seeds). The official evaluation script is used for evaluation. We report the Strict, Combined Evaluation (official) metric here.

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
mBERT base, cased (Dev)	86.97	87.04	86.66	87.11	86.53	86.86
mBERT base, cased (Test)	85.90	86.37	86.47	86.56	86.00	86.26
German BERT base, cased (Dev)	87.36	87.03	87.55	87.53	87.23	87.34
German BERT base, cased (Test)	86.35	86.93	86.71	86.85	86.23	86.61
Own German BERT base, cased (Dev)	87.74	87.7	87.77	87.96	88.52	87.94
Own German BERT base, cased (Test)	86.96	86.85	87.01	86.89	86.73	86.89
German DistilBERT (v0), cased (Dev)	86.15	86.41	86.00	86.30	85.99	86.17
German DistilBERT (v0), cased (Test)	85.25	85.34	85.27	85.21	85.08	85.23

For German DistilBERT a learning rate of 6e-5 was used.

Universal Depedencies

German HDT

We train three models (mBERT, German BERT and our own German BERT model) on the recently released German HDT Universal Dependencies corpus. It contains over 200K annotated sentences, resulting in one of the largest UD corpora. We use the latest data from the dev branch on the German HDT repository.

Results

We report averaged accuracy over 5 different runs (with different seeds).

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
mBERT base, cased (Dev)	98.36	98.35	98.36	98.36	98.36	98.35
mBERT base, cased (Test)	98.58	98.57	98.58	98.58	98.58	98.58
German BERT base, cased (Dev)	98.37	98.37	98.37	98.39	98.35	98.37
German BERT base, cased (Test)	98.57	98.55	98.57	98.54	98.57	98.56
Own German BERT base, cased (Dev)	98.38	98.38	98.39	98.37	98.36	98.38
Own German BERT base, cased (Test)	98.57	98.56	98.56	98.57	98.58	98.57

Italian

We've trained BERT models from scratch for Italian (both cased and uncased).

WikiNER

The WikiNER dataset from here for Italian is pre-processed into a 80/10/10 (training, development, test) split. We train for 10 epochs using the default parameters from the run_ner.py script.

Results

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
mBERT base, cased (Dev)	93.35	93.29	93.32	93.28	93.36	93.32
mBERT base, cased (Test)	93.39	93.65	93.63	93.44	93.54	93.53
ItBERT base, cased (Dev)	93.37	93.26	93.43	93.23	93.14	93.29
ItBERT base, cased (Test)	93.41	93.46	93.38	93.53	93.56	93.47
ItBERT XXL base, cased (Dev)	93.26	93.41	93.22	93.32	93.28	93.30
ItBERT XXL base, cased (Test)	93.47	93.66	93.66	93.65	93.62	93.61

EVALITA 2009

We use the NER data provided by the EVALITA 2009 shared task. We train for 20 epochs using the default parameters from the run_ner.py script.

Results

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
mBERT base, cased (Test)	84.93	85.52	85.33	85.16	84.98	85.18
ItBERT base, cased (Test)	85.88	85.96	85.27	85.60	86.08	85.76
ItBERT XXL base, cased (Test)	88.41	88.06	88.38	87.70	88.09	88.13

PoSTWITA

We use the dataset for the PoSTWITA shared task to report results for PoS tagging (Twitter). The dataset is used from this repository. We report results both for cased and uncased BERT models. We train for 20 epochs using the default parameters from the run_ner.py script.

Model	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
mBERT base, cased (Test)	91.47	91.47	91.68	91.52	91.54	91.54
ItBERT base, cased (Test)	93.75	93.38	93.83	93.86	93.54	93.67
ItBERT XXL base, cased (Test)	93.79	93.82	93.56	93.78	93.79	93.75
mBERT base, uncased (Test)	91.75	91.97	91.56	91.98	91.97	91.85
ItBERT base, uncased (Test)	93.48	93.73	93.98	93.46	93.35	93.60
ItBERT XXL base, uncased (Test)	93.68	93.51	93.83	93.81	93.51	93.67

ToDo

Release models (incl. BertTokenizer and BertModel example usage)
Provide more details about training parameters, and prediction script
Release trained BERT model: German DistilBERT
Add link to Italian BERT models (cased and uncased)

sunling120 / fine-tuned-berts-seq

🤗 + ⚛️ Fine-tuned Transformers compatible BERT models for Sequence Tagging

Changelog

CoNLL datasets

Results

English

German

Dutch

Spanish

GermEval 2014

Results

Universal Depedencies

German HDT

Results

Italian

WikiNER

Results

EVALITA 2009

Results

PoSTWITA

ToDo

About

Languages