KurdishBLARK / KTC-Segmented

A segmented version of KTC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KTC-Segmented

This repository is the sentence segmented KTC. It follows the KTC's structure. Each file is the line sigmented form of its counterpart in the raw corpus. The segmentation process and related discussions have been presented in a paper entitled "Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts". The paper is appeared at AfricaNLp Workshop at ICLR 2020. See the presentation of the related article here. See the related poster here.

If you use this data, referring to it, or referring to its related paper, please cite it as follows:

@inproceedings{abdulrahman2020using,
    title = "Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts",
    author = "Abdulrahman, Roshna Omer  and Hassani, Hossein},
    booktitle = "Proceedings of the AfricaNLP Wrokshop at ICLR 2020",
    month = "4",
    year = "2020",
    address = "Virtual",
    url = "http://export.arxiv.org/pdf/2004.14134",
    eprint = "2004.14134",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
}