hinrikur / FAR-GOLD

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Faroese Gold Standard Corpus

This repo contains the ongoing project of creating a gold standard PoS tagged corpus for Faroese.

Faroese texts are tagged with the Faroese implementation of ABLTagger and hand corrected using MoLL, with a middle step of mapping back and forth between Icelandic and Faroese PoS tagging schemes.

Source corpora

Currently the source corpora for the project are:

TBD: Further corpora and scope of sampled text from each corpus.

Directories:

correction - Files undergoing manual correction.

  • fo_tagged contains files tagged using Faroese (Sosialurin) tagset.
  • to_correct contains files ready for manual correction (converted to MIM-GOLD tagset).
  • finished contains fully corrected files.

gold - (WIP) Directory of "release" version of gold corpus

source_corpora - Unedited source corpora for the project

tagsets - lists of various tagsets compiled for reference

scripts - various scripts used for pre-processing the source corpora for correction

About


Languages

Language:Python 99.5%Language:Shell 0.5%