Helsinki-NLP / LSDC-morph

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LSDC-morph

This dataset contains PoS tagged and morphologically annotated data in Low Saxon dialects from several centuries mostly covering the modern period starting from 1650. It will continuously be updated as the annotation progresses. In our annotation, we follow the CoNLL-U format.

Content

So far, this dataset contains sentences from four German Low Saxon dialects: German North Saxon (Holsatian / Holstein dialect (HOL)), Marchian / Brandenburgish (MAR), Mecklenburgish - West Pomeranian (MKB) and Eastphalian (OFL). These are represented by 50 sentences each, except for the Eastphalian part which contains 66 sentences on the whole since sentences 5–20 are lines from a poem. More dialects will be added in future releases.

Please, cite the following paper if you use data from this distribution:

@inproceedings{siewert-etal-2021-towards,
    title = "Towards a balanced annotated Low {S}axon dataset for diachronic investigation of dialectal variation",
    author = {Siewert, Janine  and
      Scherrer, Yves  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)",
    month = "6--9 " # sep,
    year = "2021",
    address = {D{\"u}sseldorf, Germany},
    publisher = "KONVENS 2021 Organizers",
    url = "https://aclanthology.org/2021.konvens-1.25",
    pages = "242--246",
}

License

Our data are released under this licensing scheme:

License: CC BY-NC 4.0

  • We do not own any of the texts from which these data have been extracted.
  • We license the actual packaging of these data under the Creative Commons Attribution-NonCommercial 4.0 International Public License

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And contact Janine Siewert at the following email address: janine DOT siewert AT helsinki.fi.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

About