sandstorm12 / persian_readability_dataset

A readability dataset in Persian language

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Persian Text Readability Dataset

A readability dataset in the Persian (Farsi) language.

Information

Level # of texts Avg. words per text
0 (easy) 2953 28.8
1 (medium) 572 39.8
2 (hard) 322 62.1
Total 3847 33.2

How to load?

import pickle

address = "/path/to/dataset.pkl"
with open(address, 'rb') as handle:
    dataset = pickle.load(handle) # <class 'list'[<class 'str'> <class 'int'> <class 'list'[<class 'float'> <class 'float'> <class 'float'>]>]>

What is in the dataset?

The dataset (pickle file) contains a list of data points. Each data point includes a string (text), an int (label), and a list of floats (rater readability level). The text is in Persian and rated by at least three raters. The label is the score given to the text by the raters. Only texts with over 80% agreement between raters are included. The rater readability level is the average bias of raters toward each readability level. For example, a rater readability level of [0.1, 0.5, 0.4] tells you that the raters of this text have scored other texts 0 10% of the time, 1 50% of the time, and 2 40% of the time. Utilization of this metric is critical as raters have different education and reading levels.

Acknowledgement

Our sincere gratitude goes out to the undergraduate computer engineering students at the K.N. Toosi University of Technology who participated in gathering annotations.

Citation

@inproceedings{mohammadi2020machine,
  title={A machine learning approach to Persian text readability assessment using a crowdsourced dataset},
  author={Mohammadi, Hamid and Khasteh, Seyed Hossein},
  booktitle={2020 28th Iranian Conference on Electrical Engineering (ICEE)},
  pages={1--7},
  year={2020},
  organization={IEEE}
}

About

A readability dataset in Persian language