LuisGalazM / swedish-bert-models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Swedish BERT Models

Update 2020-02-26: Huggingface BERT-base and NER are updated
Update 2020-04-27: Added some comparisons to NER performance of this and other BERTs
Update 2020-07-03: You can now cite us!
Update 2020-07-11: TF models available through Huggingface Transformers and comparison to BotXO model
Update 2020-10-15: Added POS and experimental QA models.
Update 2020-10-15: Added Electra-small generator and discriminator models.

The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description is available in our paper Playing with words at the national library - Making a Swedish BERT.

The following five models are currently available:

  • bert-base-swedish-cased (v1.1) - A BERT trained with the same hyperparameters as first published by Google.
  • bert-base-swedish-cased-ner (experimental) - a BERT fine-tuned for NER using SUC 3.0.
  • bert-base-swedish-cased-pos (experimental) - a BERT fine-tuned for POS using SUC 3.0.
  • bert-base-swedish-cased-squad-experimental (experimental) - A first attempt att a Swedish QA using machine translated SQuAD 2.0 dataset
  • albert-base-swedish-cased-alpha (alpha) - A first attempt at an ALBERT for Swedish.
  • electra-small-swedish-cased - Small Electra models (generator and discriminator) trained for one million steps.

All models are cased and trained with whole word masking.

Files

name files
bert-base-swedish-cased config, vocab, pytorch_model.bin, TF checkpoint
bert-base-swedish-cased-ner config, vocab pytorch_model.bin
bert-base-swedish-cased-pos config, vocab pytorch_model.bin
bert-base-swedish-cased-squad-experimental config, vocab pytorch_model.bin
albert-base-swedish-cased-alpha config, sentencepiece model, pytorch_model.bin, TF checkpoint
electra-small-swedish-cased-generator config vocab pytorch_model.bin tf_model.h5
electra-small-swedish-cased-discriminator config vocab pytorch_model.bin tf_model.h5

Usage requirements / installation instructions

The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the do_lower_case flag parameter set to False and keep_accents to True (for ALBERT).

To create an environment where the examples can be run, run the following in an terminal on your OS of choice.

# git clone https://github.com/Kungbib/swedish-bert-models
# cd swedish-bert-models
# python3 -m venv venv
# source venv/bin/activate
# pip install --upgrade pip
# pip install -r requirements.txt

On some platforms, notably MacOSX < 10.15, you may have to install a Rust compiler for Transformers to install.

BERT Base Swedish

UPDATE: for Transformers==2.5.0 add the parameter use_fast=False to AutoTokenizer.from_pretrained(...) to retain accented characters such as Å, Ä and Ö.

A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:

from transformers import AutoModel,AutoTokenizer,TFAutoModel

tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')

# Using TF models
model = TFAutoModel.from_pretrained('KB/bert-base-swedish-cased')

BERT base fine-tuned for Swedish NER

This model is fine-tuned on the SUC 3.0 dataset. Preliminary evaluation (F1) compared other BERTs are as follows:

model PER ORG LOC TME MSR WRK EVN OBJ AVG
AF-AI 0.913 0.780 0.913 0.655 0.828 0.596 0.716 0.710 0.898
BotXO -- -- -- -- -- -- -- -- 0.899
BERT Multi 0.945 0.834 0.942 0.888 0.853 0.631 0.792 0.761 0.906
KB-BERT 0.961 0.884 0.958 0.906 0.890 0.720 0.834 0.770 0.928

Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:

from transformers import pipeline,TFBertForTokenClassification

nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')

nlp('Kalle och Pelle startar firman Kalle och Pelle.')

# Specifically using Tensorflow

tf = TFBertForTokenClassification.from_pretrained('KB/bert-base-swedish-cased-ner')
nlp = pipeline('ner', model=tf, tokenizer='KB/bert-base-swedish-cased-ner')

Running the Python code above should produce in something like the result below. Note that the model disambiguates between the names of the persons and the name of the company.

[ { 'word': 'Kalle', 'score': 0.9998126029968262, 'entity': 'PER' },
  { 'word': 'Pelle', 'score': 0.9998126029968262, 'entity': 'PER' },
  { 'word': 'Kalle',   'score': 0.9814832210540771, 'entity': 'ORG' }
  { 'word': 'och',   'score': 0.9814832210540771, 'entity': 'ORG' }
  { 'word': 'Pelle',   'score': 0.9814832210540771, 'entity': 'ORG' } ]

Entity types used are TME for time, PRS for personal names, LOC for locations, EVN for events and ORG for organisations. These labels are subject to change.

The BERT tokenizer often splits words into multiple tokens, with the subparts starting with ##, for example the string Engelbert kör Volvo till Herrängens fotbollsklubb gets tokenized as Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb. To glue parts back together one can use something like this:

text = 'Engelbert tar sin Rolls-Royce till Tele2 Arena för att titta på Djurgården IF ' +\
       'som spelar fotboll i VM klockan två på kvällen.'

nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner', ignore_labels=[])
l = []
t = nlp(text)
in_word=False

for i,token in enumerate(t):
    if token['entity'] == 'O':
        in_word = False
        continue

    if token['word'].startswith('##'):
        # deal with (one level of) orphaned ##-tokens
        if not in_word:
            l += [ t[i-1] ]
            l[-1]['entity'] = token['entity']
        
        l[-1]['word'] += token['word'][2:]
    else:
        l += [ token ]

    in_word = True

print(l)

Which should result in the following (though less cleanly formated):

[ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
  { 'word': 'Rolls',         'score': 0.99..., 'entity': 'OBJ'},
  { 'word': '-',             'score': 0.99..., 'entity': 'OBJ'},
  { 'word': 'Royce',         'score': 0.99..., 'entity': 'OBJ'},
  { 'word': 'Tele2',         'score': 0.99..., 'entity': 'LOC'},
  { 'word': 'Arena',         'score': 0.99..., 'entity': 'LOC'},
  { 'word': 'Djurgården',    'score': 0.99..., 'entity': 'ORG'},
  { 'word': 'IF',            'score': 0.99..., 'entity': 'ORG'},
  { 'word': 'VM',            'score': 0.99..., 'entity': 'EVN'},
  { 'word': 'klockan',       'score': 0.99..., 'entity': 'TME'},
  { 'word': 'två',           'score': 0.99..., 'entity': 'TME'},
  { 'word': 'på',            'score': 0.99..., 'entity': 'TME'},
  { 'word': 'kvällen',       'score': 0.54..., 'entity': 'TME'} ]

ALBERT base

The easisest way to do this is, again, using Huggingface Transformers:

from transformers import AutoModel,AutoTokenizer

tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')

Acknowledgements ❤️

  • Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
  • Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
  • Models are hosted on S3 by Huggingface 🤗

Citation in BibTex format

If you wish to reference this work, please use to following:

@misc{swedish-bert,
    title={Playing with Words at the National Library of Sweden -- Making a Swedish BERT},
    author={Martin Malmsten and Love Börjeson and Chris Haffenden},
    year={2020},
    eprint={2007.01658},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About