c-robin / transliteration

Automatically exported from code.google.com/p/transliteration

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DSS project

The goal of this project is to be able to automatically transliterate words from one language to another. In particular, we focus on the Spanish-Portuguese and English-Russian language pairs. A first approach that we are trying is to use substitution rules, of the form: ción -> ção.

A definition of transliteration: http://en.wikipedia.org/wiki/Transliteration.

Installation

You need to install some stuff:

Python 3:

sudo apt-get install python3

Levenshtein library for python 3 (https://pypi.python.org/pypi/python-Levenshtein/0.11.1/):

cd python-Levenshtein-0.11.1
sudo python3 setup.py install

Run the code

To generate rules (please don't overwrite rules.txt):

./rules.py confidence support length rules2.txt

To test these rules:

cat rules2.txt | ./model.py

Or directly:

./rules.py confidence support length | ./model.py

Baseline, with only a few rules that were manually crafted:

cat rules.txt | ./model.py

##Other leads

Use giza++ (http://sourceforge.net/projects/mgizapp/) to align the letters. Then use a HMM model or statistical translation to do the transliteration. Learn a transducer, using a RPNI-like algorithm (e.g. OSTIA).

##Bibliography

A few (seemingly) interesting reads:

Kevin Knight and Jonathan Graehl. Machine transliteration. Computational Linguistics, 1998. (About Japanese to English phonetic transcription, not really transliteration.)

Yaser Al-Onaizan and Kevin Knight. Machine Transliteration of Names in Arabic Text. SEMITIC '02, 2002. (More interesting, because it is a spelling-based approach.)

Jong-Hoon Oh, Key-Sun Choi, and Hitoshi Isahara. A comparison of different machine transliteration models. J. Artif. Int. Res., 2006.

Anil Kumar Singh, Sethuramalingam Subramaniam, and Taraka Rama. Transliteration as alignment vs. transliteration as generation for crosslingual information retrieval. TAL, 2010.

About

Automatically exported from code.google.com/p/transliteration


Languages

Language:TeX 56.1%Language:Perl 23.9%Language:Python 17.2%Language:Shell 2.2%Language:Makefile 0.6%