congphuc / viet-orthographic-normalization

Vietnamese orthographic normalization for preprocessing.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

viet-orthographic-normalization

Vietnamese orthographic normalization for preprocessing.

Vietnamese Orthographic Variation Dictionary

A file variant_dic/variants_without_wrong.tsv is a Vietnamese orthographic variation dictionary. I pick it up from VNTQcorpus(big).txt using pick_variant_from_corpus.py python script.

This file formats are

  • Row of odd number:syllable
  • Row of even number:left syllable's frequency in the corpus

Acknowledgement

A part of scripts are borrowed from Luu Tuan Anh.

About

Vietnamese orthographic normalization for preprocessing.

License:MIT License


Languages

Language:Python 100.0%