connormayer / uyghur_regex_searcher

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DOI

Three searchable online Uyghur text corpora

This repository contains corpora created from three Uyghur websites:

  • Uyghur Akademiyisi (Uyghur Academy): A legal research organization that publishes articles on Uyghur culture and politics. Retrieved July 2022.
  • Uyghur Awazi (Uyghur Voice): An Uyghur-language newspaper published in Almaty, Kazakhstan. Retrieved January 2020.
  • Radio Free Asia (RFA): A US-sponsored non-profit news organization, Uyghur language website. Retrieved July 2022.

The corpora folder contains each of the corpora. The XX_documents.zip file in each directory contains the raw text of every article, and the metadata.csv file contains the listing of articles and corresponding metadata. The corpora are stored in zip files for space and efficiency reasons. The scripts that operate on this data automatically zip/unzip them.

The regex_searcher.py script can be modified to search all the corpora for sentences containing a specified regular expression. It returns a .csv file containing all the sentences with matches. Keep in mind that this script removes punctuation before searching.

If you find this repository useful, please cite the following items:

Mayer, C. (2021). Issues in Uyghur backness harmony: Corpus, experimental, and computational studies (Unpublished doctoral dissertation). University of California, Los Angeles.

Mayer, C., Major, T. (2023). Three searchable online Uyghur text corpora (Version 0.1.0) [Computer software]. https://doi.org/10.5281/zenodo.8221677

About

License:GNU General Public License v3.0


Languages

Language:Python 100.0%