This repo contains the ongoing project of creating a gold standard PoS tagged corpus for Faroese.
Faroese texts are tagged with the Faroese implementation of ABLTagger and hand corrected using MoLL, with a middle step of mapping back and forth between Icelandic and Faroese PoS tagging schemes.
Currently the source corpora for the project are:
- Sosialurin Corpus (Hansen et al. 2004, Hafsteinsson 2020)
- Faroese Parsed Historical Corpus (FarPaHC) - Extracted from UD release
- Faroese Text Corpus (FTS)
- UD Faroese OFT
TBD: Further corpora and scope of sampled text from each corpus.
fo_tagged
contains files tagged using Faroese (Sosialurin) tagset.to_correct
contains files ready for manual correction (converted to MIM-GOLD tagset).finished
contains fully corrected files.