BengaliAI / gpun

Grapheme Parser and unicode normalization work

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gpun

Grapheme Parser and unicode normalization work

Environment Setup

python requirements

  • pip requirements: pip install -r requirements.txt

Its better to use a virtual environment OR use conda-

  • conda: use environment.yml: conda env create -f environment.yml

LOCAL ENVIRONMENT: Experimentation Environment

OS          : Ubuntu 20.04.3 LTS       
Memory      : 23.4 GiB 
Processor   : Intel® Corei5-8250U CPU @ 1.60GHz × 8    
Graphics    : Intel® UHD Graphics 620 (Kabylake GT2)  
Gnome       : 3.36.8

Batch Execution

  • run script.sh (change execution mode if needed: sudo chmod +x script.sh)
  • set the DATA_DIR variable to a location to save the processed data

Separate Execution

  • python download.py -h
usage: Oscar Data Download script [-h] [--chunk_size CHUNK_SIZE] data_dir

positional arguments:
  data_dir              Path to save the oscar data as csv

optional arguments:
  -h, --help            show this help message and exit
  --chunk_size CHUNK_SIZE
                        number of data to store in one csv : default=50000
  • python words.py -h
usage: Oscar Data Word conversion script [-h] [--chunk_size CHUNK_SIZE] data_dir

positional arguments:
  data_dir              Path where oscar data is saved as csv: The data folder should contain sub-folders as described in readme

optional arguments:
  -h, --help            show this help message and exit
  --chunk_size CHUNK_SIZE
                        number of data to tokenize in a chunk : default=5000

Adding a new language

  • change languages.py with the
    • oscar language code as a new key and
    • regex pattern of unicode blocks for the language as value:
#---------------------------------------------------------------
# language
#---------------------------------------------------------------
languages={}
languages["bn"]=u'[\u0980-\u09FF]+'
languages["hi"]=u'[\u0900-\u097F]+'
languages["ml"]=u'[\u0D00-\u0D7F]+'
languages["gu"]=u'[\u0A80-\u0AFF]+'
languages["ta"]=u'[\u0B80\u0BFF]+'
languages["pa"]=u'[\u0A00\u0A7F]+'
languages["or"]=u'[\u0B00-\u0B7F]+'

About

Grapheme Parser and unicode normalization work


Languages

Language:Python 98.1%Language:Shell 1.9%