NorBERT

This repository contains in-house code used in training and evaluating NorBERT-1 and NorBERT-2: large-scale Transformer-based language models for Norwegian. The models were trained by the Language Technology Group at the University of Oslo. The computations were performed on resources provided by UNINETT Sigma2 - the National Infrastructure for High Performance Computing and Data Storage in Norway.

For most of the training, BERT For TensorFlow from NVIDIA was used. We made minor changes to their code, see the patches_for_NVIDIA_BERT subdirectory.

NorBERT models training was conducted as a part of the NorLM project. Check this paper for more details:

Andrey Kutuzov, Jeremy Barnes, Erik Velldal, Lilja Øvrelid, Stephan Oepen. Large-Scale Contextualised Language Modelling for Norwegian, NoDaLiDa'21 (2021)

NorBERT-3

In 2023, we released a new family of NorBERT-3 language models for Norwegian. In general, we now recommend using these models:

NorBERT 3 xs (15M parameters)
NorBERT 3 small (40M parameters)
NorBERT 3 base (123M parameters)
NorBERT 3 large (323M parameters)

NorBERT-3 is described in detail in this paper: NorBench – A Benchmark for Norwegian Language Models (Samuel et al., NoDaLiDa 2023)

About

Large-scale language models for Norwegian

Creative Commons Zero v1.0 Universal

Languages

Language:Python 77.0%Language:Shell 23.0%