Belval / TextRecognitionDataGenerator

A synthetic data generator for text recognition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Arabic Alphabet shapes

Mahmuod1 opened this issue · comments

first of all I Would to thank you for this awesome project
when use the arabic_reshaper repo when generating the the text for fix the Arabic shapes issue ,
i would produce other issue that if you know the arabic_reshaper repo generate a new shape by compaining chars like
the two char (لا) if we reshape it by arabic_reshaper algorithm would generate the char (ﻻ) they seems the same by the first which is the correct one two chars and the other produced by arabic_reshaper algorithm is one char Unicode
so they look like the same as a interface or UI but actually they not
so i suggest to solve this small issue by just reshaping the arabic text that will put in the image and other one will not reshaping that will be in the labels.txt file

Hi!

Arabic is always difficult for me because I can't easily tell what would be considered good or bad for a native reader. Can you provide an example that I could add to the test suite? I am not opposed to the idea of either making this the default or adding an option for it, but I would need a test case as I hardly can tell the difference.

Thanks!

Arabic has main separated chars (ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيٱٷٹٺٻپٿڀڃڄچڇڈڌڍڎڑژڤڦکڭگڱڳںڻھۀہۅۆۇۈۉۋیېےۓ) you can get them from arabic_reshaper package and as I said the Arabic Text is valid until this line i add new list before reshaping copy from the strings list and this list will be write in the labels.txt file and the main strings list will be drawed in the image and this fix the issue

the valid Arabic text if the all chars alphabet separated as the above list

commented

@Belval, a quick explanation:

arabic-reshaper takes Arabic Unicode characters and converts them into equivalent Arabic presentation forms. Not all Arabic script languages are supported, i.e. not all Arabic characters have presentation forms. Presentation forms are for backwards compatibility with older standards. So in the case of the Lam-Alef ligature, U+0644 followed by U+0627 should be displayed as a ligature in certain calligraphic traditions (this is handled by font rendering). The arabic-reshaper module converts U+0644 U+0627 to U+FEFB the equivalent presentation form of the ligature glyph.

python-bidi takes a logically ordered string and reorders it so the string is visually ordered.

Essentially it's a hack to display Hebrew and certain Arabic script languages in non-Unicode compliant environments or modules.

It can not support all languages that are written in the Arabic script, and it can not support all the other RTL scripts encoded in Unicode that require complex rendering.

【ug】The Uyghur language, also a language belonging to the Arabic language family, has a similar situation。

Our word will choose different characters for different scenarios to combine to produce the final word.

image

for example:
【adem,papa】the character a【array[1]】 will use diffrent unicode,adem is first one, papa is last one.
adem ->ad -> 【array[0][first] + array[1][first]】+d
papa ->pa -> p + 【array[1][last]】

and also [l+a] is same with arabic :
image