Arabic Text Diacritization

This repository contains the dataset, helpers, and systems comparison for our paper on Arabic Text Diacritization:

"Arabic Text Diacritization Using Deep Neural Networks", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, and Mahmoud Al-Ayyoub, ICCAIS 2019.

Files

train.txt - Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
val.txt - Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
test.txt - Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset

constants
- ARABIC_LETTERS_LIST.pickle - Contains list of Arbaic letters
- CLASSES_LIST.pickle - Contains list of all possible classes
- DIACRITICS_LIST.pickle - Contains list of all diacritics
count_characters.py - Counts the number of Arabic letters and diacritics in a file
count_fathatan.py - Counts the number of fathatan occurrences before and after Alif in all files from a folder
diacritization_stat.py - Calculates DER and WER using the gold data and the predicted output
diacritics_rate_extractor.py - Keeps lines with p% diacritics to Arabic characters rate or more in all files from a folder
file_lookup.py - Searches for a line in all files from a folder
fix_fathatan.py - Changes after-Alif fathatan to before-Alit fathatan in a file
remove_diacritics.py - Removes diacritics from a file
transliteration.py - Converts a file from Arabic text to Buckwalter transliteration and vice-versa
pre_process_tashkeela_corpus.ipynb - Pre-process Tashkeela Corpus data

ali-soft - Contains some bugs that exist in Ali-Soft system
farasa - Contains Farasa system output, fixed output, and DER/WER statistics
harakat - Contains Harakat system testing script, output, fixed output, and DER/WER statistics
madamira - Contains MADAMIRA system output, fixed output, and DER/WER statistics
mishkal - Contains Mishkal system output, fixed output, and DER/WER statistics
shakkala - Contains Shakkala system data splitting script, output, fixed output, and DER/WER statistics
tashkeela_model - Contains Tashkeela-Model system output, fixed output, and DER/WER statistics for each n-gram model provided by them

The project is available as open source under the terms of the MIT License.

Benchmark Arabic text diacritization dataset

MIT License

Language:Python 56.8%Language:Jupyter Notebook 43.2%