Niger-Volta-LTI / yoruba-adr

Automatic Diacritic Restoration of Yorùbá language Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automatic Diacritic Restoration of Yorùbá Text

Motivations

Nigeria’s dying languages!

Applications

  • Generating very large, high quality Yorùbá text corpora

    • [physical books] → OCR → [undiacritized text] → ADR → [clean diacritized text] Physical books written in Yorùbá (novels, manuals, school books, dictionaries) are digitized via Optical Character Recognition (OCR), which may not fully respect tonal or orthographic diacritics. Next, the undiacritized text is processed to restore the correct diacritics.
    • Correcting digital texts scraped online on Twitter, Naija forums, articles, etc
    • Suggesting corrections during manual text entry (spell/diacritic checker)
  • Preprocessing text for training Yorùbá

    • language models
    • word embeddings
    • text-language identification (so Twitter can stop claiming Yorùbá text is Vietnamese haba!)
    • part-of-speech taggers
    • named-entity recognition
    • text-to-speech (TTS) models (speech synthesis)
    • speech-to-text (STT) models (speech recogntion)

Pretrained ADR Models

Datasets

https://github.com/Niger-Volta-LTI/yoruba-text

Train a Yorùbá ADR model

Dependencies

  • Python3 (tested on 3.5, 3.6, 3.7)
  • Install all dependencies: pip3 install -r requirements.txt

We train models on an Amazon EC2 p2.xlarge instance running Deep Learning AMI (Ubuntu) Version 5.0 (ami-c27af5ba). These machine-images (AMI) have Python3 and PyTorch pre-installed as well as CUDA for training on the GPU. We use the OpenNMT-py framework for training and restoration.

  • To install PyTorch 0.4 manually, follow instructions for your {OS, package manager, python, CUDA} versions

  • git clone https://github.com/Niger-Volta-LTI/yoruba-adr.git

  • git clone https://github.com/Niger-Volta-LTI/yoruba-text.git

  • Install dependencies: pip3 install -r requirements.txt

  • Note that NLTK will need some extra hand-holding if you've installed it for the first time:

     Resource punkt not found.
     Please use the NLTK Downloader to obtain the resource:
    
     >>> import nltk
     >>> nltk.download('punkt')
    

Training an ADR sequence-to-sequence model

To start data-prep and training of the Bahdanau-style soft-attention model, execute the training script from the top-level directory: ./01_run_training.sh or ./01_run_training_transformer.sh

Learn more

About

Automatic Diacritic Restoration of Yorùbá language Text

License:MIT License


Languages

Language:Python 78.9%Language:Jupyter Notebook 15.6%Language:Shell 5.4%