A curated list of awesome resources for Danish language technology
- NST
- NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
- NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
- NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
- NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
- CLARIN-DK-UCPH
- The Danish Parliament Corpus 2009 - 2017, v1. The license is Creative Commons - Attribution 4.0 International
- Grundtvig's Works Corpus. Not for commercial use as the license is Creative Commons - Attribution-NonCommercial 4.0 International.
- DK-CLARIN Reference Corpus of General Danish Only for academic use.
- SemDaX For educational, teaching or research purposes only. POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences
- NOMCO is "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ Scholia ]
- Danish Propbank, commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
- DKhate, corpus of 3600 hate speech from Twitter and Reddits as well as news comments (to appear in 2019)
Parallel corpora:
- Europarl, parallel sentences between Danish and English from the European Parlament.
- WikiMatrix, parallel sentences from Wikipedias. 1620 language pairs, including Danish
- NST-lexical-database A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service.
- DanNet DanNet, Danish Wordnet (v 2.2) - owl format - Danish wordnet with three-clause BSD-like license.
- Retskrivningsordbogen. The official Danish spelling dictionary digitally avaiable under its own special license.
- Opslagsord og ordklasser in CSV format.
- Lexemes, word classes and inflections. Excerpt in the CSF format available. Full list presumably available upon request.
- Lexemes, word classes, inflections, grammatical information, hyphenation and usage examples in XML. Full list presumably available upon request.
- The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
- Primary distribution site at http://da.speling.org/ seems no longer available
- In Debian-based distributions the word list may be installed with
sudo aptitude install aspell-da
and extracted withspell -d da dump master
.
- The Danish FrameNet Lexicon, 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns
- Wikidata lexemes, structured database with metadata bout lexemes, their forms and their sense. Over 50.000 lexemes including 1.800 Danish in June 2019
- kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.
- espeak - An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe.
- ResponsiveVoice - Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
- AFINN - Danish lexicons annotated for sentiment.
- afinn - Python package with AFINN Danish lexicon annotated for sentiment, also installable with
pip install afinn
.
- concreteness-estimates-da - Bill D. Thompson's concreteness estimates for Danish words, as detailed in Automatic Estimation of Lexical Concreteness in 77 Languages (Scholia).
- cc.da.300 (bin file GB large) - fastText-trained embedding on Danish part of Common Crawl and Danish Wikipedia. Read more about the method in Learning Word Vectors for 157 Languages (Scholia).
- wiki.da (bin+text file) - fastText-trained embedding on Danish Wikipedia. Read more about the method in Enriching Word Vectors with Subword Information (Scholia).
- Byte-Pair Encoding embedding - Gensim-based subword embedding. A large number of Danish embeddings are available. They differ in the size of the vocabulary (from 1000 to 200000) and subspace dimensions (from 25 to 300).
- cstlemma - lemmatiser
- Lemmy - Lemmatizer for Danish in Python
- daner - Named entity extraction.
- DaNLP - "a repository for Natural Language Processing resources for the Danish Language."
- dapipe - Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies.
- DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
- StanfordNLP. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available.
- bornholmsk - Datasets and embeddings for the Bornholmsk dialect.
- Danish resources - Finn Årup Nielsen's PDF with pointers to Danish resources.
- Scholia's topic aspect for Danish, works (mostly scientific articles) about "Danish" as listed in Wikidata.
- Language Technology Resources for Danish, list from Det Dansk Sprog- og Litteraturselskab
- European Language Resources Association (ELRA) list for Danish, list of various annotated corpora available for purchase with both commercial and non-commercial licenses.