jozefchmelar / Diacritics

Diacritics reconstruction (restoration) for Slovak text. Bachelor's thesis

Home Page:http://diakritika.fri.uniza.sk/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Diacritics

What it is

Diacritics reconstruction (restoration) for Slovak text based on finding best match in n-grams (n-gram = group of n words usually occurring together in language). This program was created for Bachelor's thesis at Faculty of Management Science and Informatics, University of Žilina.

How it works

The program uses data from Slovak National Corpus from Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. We used data set/language corpus prim-8.0-public-all made out of 1.5 billion of tokens (namely subcorpuses of 4-grams, 3-grams, 2-grams and words). Yout can find them all here. Algorithm reconstructs every single word separately. It uses data structure trie for fastest access to the list of appropriate n-grams for each non-diacritics word. List of appropriate n-grams for non-diacritics word consists only of n-grams containing that word. In addition the list is grouped by n (from 4-grams to 1-gram) and sorted by absolute occurance in language. Then all n-grams are compared with the word and it's surrouding words one by one until there is match. After then the word is replaced with found diacritics form. For more info in Slovak look at the bachelor's thesis: Automatická rekonštrukcia diakritiky pre slovenčinu

Used technologies

Final software

There are two final versions of the program: The first - faster one (0.4ms per word), using RAM only, with the success rate 98.07%. The second - slower one (4ms per word), using hard disk, with success rate 98.17%. Here you will find:

  • DLL ready to use
  • Simple web-site for easy, user-friendly interacting with the program

Try it here

diakritika.fri.uniza.sk

To run it you need to download these files:

https://www.dropbox.com/s/7uraxif4ocfay8k/diacritics-reconstructor-necessary-files.zip?dl=0

About

Diacritics reconstruction (restoration) for Slovak text. Bachelor's thesis

http://diakritika.fri.uniza.sk/


Languages

Language:C# 83.4%Language:HTML 15.8%Language:CSS 0.7%Language:JavaScript 0.1%