Awesome Resources for NLP/NLG in Swedish

For those of you who are not used to long markdown files, GitHub gracefully generates a table of contents for you! See more info on how to find it here.

Corpora

Corpus = a collection of raw unlabeled texts

Monolingual

Free and Downloadable

Språkbanken Text -- this is a hub page for many Swedish corpora maintained by the Språkbanken Text, monolingual corpora come from newspapers, blog posts, literature of different years (some from as early as the 18th century). Note that many of these corpora contain scrambled sentences.
CC-100 -- documents extracted from Common Crawl, automatically classified and filtered. Swedish part is 21 GB of raw text.
mC4 -- a colossal, cleaned version of Common Crawl's web crawl corpus (C4), Swedish part contains about 65GB of raw text
SOU corpus -- cleaned and further processed versions of Swedish Government Official Reports (Statens offentliga utredningar, SOU), covers the reports between 1994 and 2020
SweDraCor -- corpus of 68 TEI-encoded Swedish language plays taken from eDrama project
Swedish Poetry -- poetry corpus
LBPF -- Swedish prose fiction with modern spelling from Litteraturbanken
SBS -- a collection of sentences from Swedish blog posts from November 2010 until September 2012, contains scrambled sentences -- NOTE: links seem to be broken as of 2022-05-25
Project Runeberg -- copyright-free Swedish literature
Swedish Diachronic Corpus -- text corpora covering the time period from Old Swedish to present day for various text genres

Free and Available by Request

OSCAR — scrambled sentences extracted from Common Crawl and classified with a language detection model. It's Swedish portion comprises 48GB of raw text with roughly 7.5M documents and 5B words
Polyglot's processed Swedish Wikipedia

Parallel

OPUS -- The Open Parallel Corpus, a hub for parallel datasets for many pairs of languages, including to/from Swedish.
Språkbanken Text -- this is a hub page for many Swedish corpora maintained by the Språkbanken Text, the available parallel corpora are EuroParl (Swedish part of European Parliament Proceedings Parallel Corpus) and ASPAC (Swedish part of The Amsterdam Slavic Parallel Aligned Corpus). Note that both corpora contain scrambled sentences.
SMULTRON -- a parallel treebank that contains around 1000 sentences in English, German and Swedish

Datasets

Dataset = a collection of labeled texts

Monolingual

Free and Downloadable

Swedish-first

Swedish Universal Dependencies treebanks -- can be used to train PoS-taggers, lemmatizers and dependency parsers
- Talbanken: 96K tokens, 6K sentences
- LinES: 90K tokens, 5K sentences
- PUD: 19K tokens, 1K sentences
SweQUAD-MC -- a multiple choice question answering dataset
Swedish-sentiment -- a sentiment analysis dataset of 10000 texts with roughly 50/50 split between positive and negative sentiments
Swedish-Causality-Datasets -- namely causality recognition and causality ranking dataset, taking texts from the official reports of Swedish Government
Swedish-MWE-dataset -- a multiword expression dataset, containing 96 Swedish expressions annotated for their degrees of compositionality
Swedish-NER
- by Andreas Klintberg -- semi-manually annotated Webbnyheter 2012 corpus from Språkbanken, 4 types of named entities: person, organization, location, miscellaneous.
- by Robert Lorentz
- The Written Works Corpus -- named entities for written works: ART, BOOK, GAME, MOVIE, MUSIC, PLAY, RADIO and TV. A bit more detailed description about the corpus is here
SIC -- a corpus of Swedish Internet tags, manually annotated wth part of speech tags and named entities
SUSC -- a corpus of seven novels by August Strindberg annotated with part of speech tags with morphological analysis and lemmas
SNEC -- The Strindberg National Edition Corpus, both plain text version and linguistically annotated CoNLL-U version -- NOTE: links seem to be broken as of 2022-05-25
SuperLim -- a Swedish version of GLUE benchmark

Translated

OverLim -- dataset contains some of the GLUE and SuperGLUE tasks automatically translated to Swedish, Danish, and Norwegian (bokmål), using the OpusMT models for MarianMT, the translation quality was not manually checked
XNLI -- an autotranslated (Google Translate) natural language inference (NLI) dataset, no info about human correction
STS Benchmark -- a semantic textual similarity (STS) dataset, automatically translated version of the original STS Benchmark for English using Google's NMT API without human correction
SwedSQuAD -- a machine-translated version of SQuAD (Stanford Question Answering Dataset), no info about human correction

Free and Available By Request

SUC 2.0 -- annotated with part-of-speech tags, morphological analysis and lemma (all that can be considered gold standard data), as well as some structural and functional information
SUC 3.0 -- improved and extended SUC 2.0

Pre-trained resources

Word embeddings

Facebook's FastText vectors, 300-dimensional
- trained on Common Crawl + Wikipedia: vecs
- trained on language-specific Wikipedia only: vecs
- trained on Wikipedia with cross-lingual alignment: [vecs] (https://fasttext.cc/docs/en/aligned-vectors.html)
Diachronic embeddings from Språkbanken Text (based on word2vec and FastText)
NLPL repository maintained by Language Techonology Group at the University of Oslo
- Word2Vec, 100-dimensional: vecs
- ELMo, 1024-dimensional: vecs
Swectors, 300-dimensional (the released vectors are Word2Vec)
Polyglot embeddings: vecs
Kyubyong Park's vectors
- Word2Vec, 300-dimensional: vecs
- FastText, 300-dimensional: vecs
Flair embeddings, 2048-dimensional, can be used only within flair package from Zalando Research

Swedish-specific Transformer models

The code for calculating the number of parameters (comes from this thread):

PyTorch: sum(p.numel() for p in model.parameters() if p.requires_grad)
TensorFlow: np.sum([np.prod(v.shape) for v in tf.trainable_variables])

And now to the models themselves, where the code snippet above was used to estimate the number of parameters.

BERT* models from The National Library of Sweden/KBLab
- bert-base-swedish-cased: 12 layers, 768 hidden size, 12 heads, ~124M parameters
- albert-base-swedish-cased-alpha: 12 layers, 768 hidden size, 12 heads, ~14M parameters
- electra-small-swedish-cased
  - generator: 12 layers, 256 hidden size, 4 heads, ~16M parameters
  - discriminator: 12 layers, 256 hidden size, 4 heads, ~16M parameters
BERT models from Arbetsförmedlingen (The Swedish Public Employment Service)
- bert-base-swedish-uncased: 12 layers, 768 hidden size, 12 heads, ~110M parameters
- bert-large-swedish-uncased: 24 layers, 1024 hidden size, 16 heads, ~335M parameters
RoBERTa models
- trained on Swedish Wikipedia and OSCAR: model on HF Hub
- trained on mC4: model on HF Hub
- seems to be trained on OSCAR?: model on HF Hub
GPT-2 models
- trained on the Wiki40B and OSCAR: model on HF Hub
- trained on the Wiki40B only: model on HF Hub
GPT-SW3 model (3.5B parameters): model on HF Hub -- NOTE: The repository is empty as of 2022-08-23
T5 models
- trained on OSCAR: model on HF Hub

Nordic Transformer models

GPT-2 models:
- trained on Wiki40B: model on HF Hub

Multilingual Transformer models

mBERT -- multilingual BERT by Google Research
mBART50 -- multilingual BART by FAIR
mT5 -- multilingual T5 by Google Research

Dependency parsing models

Stanza's models -- trained on UD treebanks: one on Talbanken and another on LinES
MaltParser

Part of speech taggers

Stagger

Machine Translation models to/from Swedish

OPUS-MT models: models on HF Hub

Tools

Granska -- software for grammar control
Stava -- software for spell checking

Other resources

Wordlists -- here, here or here

dkalpakchi / awesome-swedish-nlp

Awesome Resources for NLP/NLG in Swedish

Corpora

Monolingual

Free and Downloadable

Free and Available by Request

Parallel

Datasets

Monolingual

Free and Downloadable

Swedish-first

Translated

Free and Available By Request

Pre-trained resources

Word embeddings

Swedish-specific Transformer models

Nordic Transformer models

Multilingual Transformer models

Dependency parsing models

Part of speech taggers

Machine Translation models to/from Swedish

Tools

Other resources

About