nipponjo / arabic-speech-to-text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

arabic-speech-to-text

This repository contains the code for training the QuartzNet ASR model (NeMo) on the QCRI-AL Jazeera Corpus.

Data preprocessing

Download the QCRI-AL Jazeera Corpus. The script a_preprocess_xml.py extracts the text segments from the xml files. The script b_filter_ds.py removes segments that include latin script or numerals. The script c_split_ds.py creates a training set and a test set from the segments.

TODO

  • Upload pretrained model
  • ...

About


Languages

Language:Python 87.2%Language:Jupyter Notebook 12.8%