dkalpakchi / awesome-swedish-nlp

A curated list of resources for natural language processing (NLP) in Swedish

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome Resources for NLP/NLG in Swedish Awesome

For those of you who are not used to long markdown files, GitHub gracefully generates a table of contents for you! See more info on how to find it here.

Corpora

Corpus = a collection of raw unlabeled texts

Monolingual

Free and Downloadable

  • Språkbanken Text -- this is a hub page for many Swedish corpora maintained by the Språkbanken Text, monolingual corpora come from newspapers, blog posts, literature of different years (some from as early as the 18th century). Note that many of these corpora contain scrambled sentences.
  • CC-100 -- documents extracted from Common Crawl, automatically classified and filtered. Swedish part is 21 GB of raw text.
  • mC4 -- a colossal, cleaned version of Common Crawl's web crawl corpus (C4), Swedish part contains about 65GB of raw text
  • SOU corpus -- cleaned and further processed versions of Swedish Government Official Reports (Statens offentliga utredningar, SOU), covers the reports between 1994 and 2020
  • SweDraCor -- corpus of 68 TEI-encoded Swedish language plays taken from eDrama project
  • Swedish Poetry -- poetry corpus
  • LBPF -- Swedish prose fiction with modern spelling from Litteraturbanken
  • SBS -- a collection of sentences from Swedish blog posts from November 2010 until September 2012, contains scrambled sentences -- NOTE: links seem to be broken as of 2022-05-25
  • Project Runeberg -- copyright-free Swedish literature
  • Swedish Diachronic Corpus -- text corpora covering the time period from Old Swedish to present day for various text genres

Free and Available by Request

Parallel

  • OPUS -- The Open Parallel Corpus, a hub for parallel datasets for many pairs of languages, including to/from Swedish.
  • Språkbanken Text -- this is a hub page for many Swedish corpora maintained by the Språkbanken Text, the available parallel corpora are EuroParl (Swedish part of European Parliament Proceedings Parallel Corpus) and ASPAC (Swedish part of The Amsterdam Slavic Parallel Aligned Corpus). Note that both corpora contain scrambled sentences.
  • SMULTRON -- a parallel treebank that contains around 1000 sentences in English, German and Swedish

Datasets

Dataset = a collection of labeled texts

Monolingual

Free and Downloadable

Swedish-first
  • Swedish Universal Dependencies treebanks -- can be used to train PoS-taggers, lemmatizers and dependency parsers
    • Talbanken: 96K tokens, 6K sentences
    • LinES: 90K tokens, 5K sentences
    • PUD: 19K tokens, 1K sentences
  • SweQUAD-MC -- a multiple choice question answering dataset
  • Swedish-sentiment -- a sentiment analysis dataset of 10000 texts with roughly 50/50 split between positive and negative sentiments
  • Swedish-Causality-Datasets -- namely causality recognition and causality ranking dataset, taking texts from the official reports of Swedish Government
  • Swedish-MWE-dataset -- a multiword expression dataset, containing 96 Swedish expressions annotated for their degrees of compositionality
  • Swedish-NER
    • by Andreas Klintberg -- semi-manually annotated Webbnyheter 2012 corpus from Språkbanken, 4 types of named entities: person, organization, location, miscellaneous.
    • by Robert Lorentz
    • The Written Works Corpus -- named entities for written works: ART, BOOK, GAME, MOVIE, MUSIC, PLAY, RADIO and TV. A bit more detailed description about the corpus is here
  • SIC -- a corpus of Swedish Internet tags, manually annotated wth part of speech tags and named entities
  • SUSC -- a corpus of seven novels by August Strindberg annotated with part of speech tags with morphological analysis and lemmas
  • SNEC -- The Strindberg National Edition Corpus, both plain text version and linguistically annotated CoNLL-U version -- NOTE: links seem to be broken as of 2022-05-25
  • SuperLim -- a Swedish version of GLUE benchmark
Translated
  • OverLim -- dataset contains some of the GLUE and SuperGLUE tasks automatically translated to Swedish, Danish, and Norwegian (bokmål), using the OpusMT models for MarianMT, the translation quality was not manually checked
  • XNLI -- an autotranslated (Google Translate) natural language inference (NLI) dataset, no info about human correction
  • STS Benchmark -- a semantic textual similarity (STS) dataset, automatically translated version of the original STS Benchmark for English using Google's NMT API without human correction
  • SwedSQuAD -- a machine-translated version of SQuAD (Stanford Question Answering Dataset), no info about human correction

Free and Available By Request

  • SUC 2.0 -- annotated with part-of-speech tags, morphological analysis and lemma (all that can be considered gold standard data), as well as some structural and functional information
  • SUC 3.0 -- improved and extended SUC 2.0

Pre-trained resources

Word embeddings

Swedish-specific Transformer models

The code for calculating the number of parameters (comes from this thread):

  • PyTorch: sum(p.numel() for p in model.parameters() if p.requires_grad)
  • TensorFlow: np.sum([np.prod(v.shape) for v in tf.trainable_variables])

And now to the models themselves, where the code snippet above was used to estimate the number of parameters.

  • BERT* models from The National Library of Sweden/KBLab
    • bert-base-swedish-cased: 12 layers, 768 hidden size, 12 heads, ~124M parameters
    • albert-base-swedish-cased-alpha: 12 layers, 768 hidden size, 12 heads, ~14M parameters
    • electra-small-swedish-cased
      • generator: 12 layers, 256 hidden size, 4 heads, ~16M parameters
      • discriminator: 12 layers, 256 hidden size, 4 heads, ~16M parameters
  • BERT models from Arbetsförmedlingen (The Swedish Public Employment Service)
    • bert-base-swedish-uncased: 12 layers, 768 hidden size, 12 heads, ~110M parameters
    • bert-large-swedish-uncased: 24 layers, 1024 hidden size, 16 heads, ~335M parameters
  • RoBERTa models
  • GPT-2 models
  • GPT-SW3 model (3.5B parameters): model on HF Hub -- NOTE: The repository is empty as of 2022-08-23
  • T5 models

Nordic Transformer models

Multilingual Transformer models

  • mBERT -- multilingual BERT by Google Research
  • mBART50 -- multilingual BART by FAIR
  • mT5 -- multilingual T5 by Google Research

Dependency parsing models

Part of speech taggers

Machine Translation models to/from Swedish

Tools

  • Granska -- software for grammar control
  • Stava -- software for spell checking

Other resources

About

A curated list of resources for natural language processing (NLP) in Swedish

License:Creative Commons Zero v1.0 Universal