Awesome Danish

A curated list of awesome resources for Danish language technology

Data

Corpora

Danish Gigaword - Collection of Danish corpora (as of May 2020 the corpus is not openly available).
OSCAR - Danish corpus derived from the Common Crawl corpus. Described in Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures (Scholia)
NST
- NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
- NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
- NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
- NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
CLARIN-DK-UCPH
- The Danish Parliament Corpus 2009 - 2017, v1. The license is Creative Commons - Attribution 4.0 International
- Grundtvig's Works Corpus. Not for commercial use as the license is Creative Commons - Attribution-NonCommercial 4.0 International.
- DK-CLARIN Reference Corpus of General Danish Only for academic use.
SemDaX - POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only.
NOMCO - "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ Scholia ]
Danish Propbank - commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
Danish Dependency Treebank v. 1.0 - Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE.
Mr. Bean corpus - Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in Tekststrukturering pa italiensk og dansk
Køge Corpus - Danish-Turkish transcribed corpus by Jens Normann Jørgensen.
Danske taler - Collection of Danish speeches. API available at https://dansketaler.dk/wp-json/wp/v2/tale
DKhate - corpus of 3600 hate speech from Twitter and Reddits as well as news comments (to appear in 2020)
DaNewsroom - Danish summarization dataset. Probably to appear in 2020. Described in DaNewsroom: A Large-scale Danish Summarisation Dataset (Scholia)

Parallel corpora

Europarl, parallel sentences between Danish and English from the European Parlament.
WikiMatrix, parallel sentences from Wikipedias. 1620 language pairs, including Danish

Spoken language corpora

DanPASS - Described in DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus (Scholia)
DK-Parole
LANCHART
Common Voice - Crowdsourced multilingual voice dataset. As of 18 December 2019 there is no Danish data. Described in Common Voice: A Massively-Multilingual Speech Corpus (Scholia)

Dictionaries and ontologies

NST-lexical-database A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service.
DanNet DanNet, Danish Wordnet (v 2.2) - owl format - Danish wordnet with three-clause BSD-like license.
Retskrivningsordbogen. The official Danish spelling dictionary digitally avaiable under its own special license.
- Opslagsord og ordklasser in CSV format.
- Lexemes, word classes and inflections. Excerpt in the CSF format available. Full list presumably available upon request.
- Lexemes, word classes, inflections, grammatical information, hyphenation and usage examples in XML. Full list presumably available upon request.
The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
- Primary distribution site at http://da.speling.org/ seems no longer available
- In Debian-based distributions the word list may be installed with sudo aptitude install aspell-da and extracted with spell -d da dump master.
The Danish FrameNet Lexicon, 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns
Wikidata lexemes - structured database with metadata bout alexemes, their forms and their sense. Over 240.000 lexemes including over 4'000 Danish lexemes in February 2020.
- Overview over Danish lexemes in Ordia - webapp with overview of content of Wikidata lexemes based on SPARQL queries.
- Wikidata lexemes latest lexemes dump in ttl - official dump of lexeme-only part of Wikidata.
AFINN - Danish lexicons annotated for sentiment.
concreteness-estimates-da - Bill D. Thompson's concreteness estimates for Danish words, as detailed in Automatic Estimation of Lexical Concreteness in 77 Languages (Scholia).
Danish Swadesh List - List of Danish words of basic concepts from The Rosetta Project.

Word sets

Danish-Similarity-Dataset - Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen.
Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set.
Four words - 100 odd-one-out sets of 4 words or phrases.

Embeddings

cc.da.300 (bin file GB large) - fastText-trained embedding on Danish part of Common Crawl and Danish Wikipedia. Read more about the method in Learning Word Vectors for 157 Languages (Scholia).
wiki.da (bin+text file) - fastText-trained embedding on Danish Wikipedia. Read more about the method in Enriching Word Vectors with Subword Information (Scholia).
Byte-Pair Encoding embedding - Gensim-based subword embedding. A large number of Danish embeddings are available. They differ in the size of the vocabulary (from 1000 to 200000) and subspace dimensions (from 25 to 300).
Danish NLPL word embedding - 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus.
Danish DSL and Reddit word2vec word embeddings - 300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trains on Danish DSL corpus and Reddit.

Neural models

Danish BERT - Weights for a BERT trained on a large Danish corpora.

Tools

Lemmatization

Lemmy - Lemmatizer for Danish in Python
cstlemma - lemmatiser

Named entity recognition

daner - Named entity extraction.
flair+danlp ner-tagger - Flair NER tagger trained by the Alexandra Institute.
Polyglot named entity extraction -

Sentiment analysis

afinn - Python package with AFINN Danish lexicon annotated for sentiment, also installable with pip install afinn.
Sentida - R package With Danish sentiment lexicon and handling of, e.g., negation. Detailed in SENTIDA: A New Tool for Sentiment Analysis in Danish (Scholia).

Automatic Speech Recognition

danspeech - DeepSpeech2-based Danish speech recognition in Python
kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.

Speech Synthesis (text-to-speech)

espeak - An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe.
ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
Google Cloud Text-to-Speech - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish.
Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at TTSMP3.

Fundamental processing

DaNLP - "a repository for Natural Language Processing resources for the Danish Language."
dapipe - Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies.
UDPipe - Non-language specific version of dapipe. Newer version of the Danish-DDT model than that which is offered by dapipe is available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998
DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
StanfordNLP. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available.
bornholmsk - Datasets and embeddings for the Bornholmsk dialect.

Competitions

ELEXIS Monolingual Word Sense Alignment Task - Predicting the relationship between two senses in each of several languages, including Danish.
OffensEval 2020 - Danish - Offensive Language Identification in Social Media competition. Described in Offensive Language and Hate Speech Detection for Danish (Scholia)

Resources about resources

Danish resources - Finn Årup Nielsen's PDF with pointers to Danish resources.
Scholia's topic aspect for Danish, works (mostly scientific articles) about "Danish" as listed in Wikidata.
Language Technology Resources for Danish, list from Det Dansk Sprog- og Litteraturselskab
European Language Resources Association (ELRA) list for Danish, list of various annotated corpora available for purchase with both commercial and non-commercial licenses.

swedebugia / awesome-danish

Awesome Danish

Data

Corpora

Parallel corpora

Spoken language corpora

Dictionaries and ontologies

Word sets

Embeddings

Neural models

Tools

Lemmatization

Named entity recognition

Sentiment analysis

Automatic Speech Recognition

Speech Synthesis (text-to-speech)

Fundamental processing

Competitions

Resources about resources

About