swedebugia / awesome-danish

A curated list of awesome resources for Danish language technology

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome Danish

A curated list of awesome resources for Danish language technology

Data

Corpora

  • Danish Gigaword - Collection of Danish corpora (as of May 2020 the corpus is not openly available).
  • OSCAR - Danish corpus derived from the Common Crawl corpus. Described in Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures (Scholia)
  • NST
    • NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
    • NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
    • NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
    • NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
  • CLARIN-DK-UCPH
  • SemDaX - POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only.
  • NOMCO - "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ Scholia ]
  • Danish Propbank - commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
  • Danish Dependency Treebank v. 1.0 - Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE.
  • Mr. Bean corpus - Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in Tekststrukturering pa italiensk og dansk
  • Køge Corpus - Danish-Turkish transcribed corpus by Jens Normann Jørgensen.
  • Danske taler - Collection of Danish speeches. API available at https://dansketaler.dk/wp-json/wp/v2/tale
  • DKhate - corpus of 3600 hate speech from Twitter and Reddits as well as news comments (to appear in 2020)
  • DaNewsroom - Danish summarization dataset. Probably to appear in 2020. Described in DaNewsroom: A Large-scale Danish Summarisation Dataset (Scholia)

Parallel corpora

  • Europarl, parallel sentences between Danish and English from the European Parlament.
  • WikiMatrix, parallel sentences from Wikipedias. 1620 language pairs, including Danish

Spoken language corpora

  • DanPASS - Described in DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus (Scholia)
  • DK-Parole
  • LANCHART
  • Common Voice - Crowdsourced multilingual voice dataset. As of 18 December 2019 there is no Danish data. Described in Common Voice: A Massively-Multilingual Speech Corpus (Scholia)

Dictionaries and ontologies

Word sets

  • Danish-Similarity-Dataset - Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen.
  • Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set.
  • Four words - 100 odd-one-out sets of 4 words or phrases.

Embeddings

Neural models

  • Danish BERT - Weights for a BERT trained on a large Danish corpora.

Tools

Lemmatization

Named entity recognition

Sentiment analysis

Automatic Speech Recognition

  • danspeech - DeepSpeech2-based Danish speech recognition in Python
  • kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.

Speech Synthesis (text-to-speech)

  • espeak - An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe.
  • ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
  • Google Cloud Text-to-Speech - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish.
  • Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at TTSMP3.

Fundamental processing

  • DaNLP - "a repository for Natural Language Processing resources for the Danish Language."
  • dapipe - Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies.
  • UDPipe - Non-language specific version of dapipe. Newer version of the Danish-DDT model than that which is offered by dapipe is available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998
  • DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
  • StanfordNLP. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available.
  • bornholmsk - Datasets and embeddings for the Bornholmsk dialect.

Competitions

Resources about resources

About

A curated list of awesome resources for Danish language technology

License:Other