List of languages in development

Question

List of languages in development

rkcosmos opened this issue 4 years ago · comments

Rakpong Kittinaradorn commented 4 years ago

I will update/edit this issue to track development process of new language. The current list is

Group 1 (Arabic script)

Arabic (DONE, August, 5 2020)
Uyghur (DONE, August, 5 2020)
Persian (DONE, August, 5 2020)
Urdu (DONE, August, 5 2020)

Group 2 (Latin script)

Serbian-latin (DONE, July,12 2020)
Occitan (DONE, July,12 2020)

Group 3 (Devanagari)

Hindi (DONE, July,24 2020)
Marathi (DONE, July,24 2020)
Nepali (DONE, July,24 2020)
Rajasthani (NEED HELP)
Awadhi, Haryanvi, Sanskrit (if possible)

Group 4 (Cyrillic script)

Russian (DONE, July,29 2020)
Serbian-cyrillic (DONE, July,29 2020)
Bulgarian (DONE, July,29 2020)
Ukranian (DONE, July,29 2020)
Mongolian (DONE, July,29 2020)
Belarusian (DONE, July,29 2020)
Tajik (DONE, April,20 2021)
Kyrgyz (NEED HELP)

Group 5

Telugu (DONE, November,17 2020)
Kannada (DONE, November,17 2020)

Group 6 (Language that doesn't share characters with others)

Tamil (DONE, August, 10 2020)
Hebrew (ready to train)
Malayalam (ready to train)
Bengali + Assamese (DONE, August, 23 2020)
Punjabi (ready to train)
Abkhaz (ready to train)

Group 7 (Improvement and possible extra models)

Japanese version 2 (DONE, March, 21 2021)+ vertical text
Chinese version2 (DONE, March, 21 2021)+ vertical text
Korean version 2(DONE, March, 21 2021)
Latin version 2 (DONE, March, 21 2021)
Math + Greek?
Number+symbol only

Guideline for new language request

To request a new language support, I need you to send a PR with 2 following files

In folder easyocr/character, we need 'yourlanguagecode_char.txt' that contains list of all characters. Please see format/example from other files in that folder.
In folder easyocr/dict, we need 'yourlanguagecode.txt' that contains list of words in your language. On average we have ~30000 words per language with more than 50000 words for popular one. More is better in this file.

If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.

Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.

Yossi Biton · Answer 1 · Fri Jul 10 2020 13:42:09 GMT+0800 (China Standard Time)

Why won't you share the training code, so people could train the model by themself ?

Valentin Malykh · Answer 2 · Fri Jul 10 2020 13:59:44 GMT+0800 (China Standard Time)

For group 4 you could add Ukrainian, Bulgarian, and may be Mongolian, although it is not Slavic it uses Cyrillic script.

Kate · Answer 3 · Fri Jul 10 2020 14:27:46 GMT+0800 (China Standard Time)

Do you plan to only work with human languages? It would be amazing to add a model to recognize mathematical formulas.

manohar-cyber · Answer 4 · Fri Jul 10 2020 19:04:45 GMT+0800 (China Standard Time)

I guess Tamil, Telugu can be added to one group because they belongs to a language group called 'Dravidian'. Meaning they relate to each other in terms of grammar, word arrangement.Two other popular( in India) language, which belong to that family can also be added to that group— Kannada and Malayalam (For further info— https://en.m.wikipedia.org/wiki/Dravidian_languages). Moreover Telugu and kannada share some common alphabets and words. I will be adding alphabet and words of kannada language for language request.
Great project, keep it up👍

b.kirilov · Answer 5 · Fri Jul 10 2020 19:58:10 GMT+0800 (China Standard Time)

For Group 4
Bulgarian
dict bg.txt
char bg_char.txt

Prakash Upadhyay · Answer 6 · Fri Jul 10 2020 20:13:34 GMT+0800 (China Standard Time)

I'd highly recomend supporting Devanagiri Script (Wiki - https://en.wikipedia.org/wiki/Devanagari), which is the fourth most widely adopted writing system in the world. Please go through the wikipedia link to understand its wide spread usage across most Ancient Languages including Sanskrit, Hindi, Marathi, Awadhi, Haryanvi.

I see you have included "Hindi" as a target language, which of course, is the most spoken language in the Indian Subcontinent.

If you could let me know what's the current word-count you have (maybe share the "dict" & "alphabets" directory), I can continue with the research to share more details about the Language as it's my First Language.

Hindi has 47 primary alphabets (including 14 Vowels & 33 Consonants).

You can contact me @ prakash.upadhyay93@yahoo.com

Arash Jafari · Answer 7 · Fri Jul 10 2020 23:01:32 GMT+0800 (China Standard Time)

Can i help for the Persian (Farsi) language ? I can supply some popular words and characters

@rkcosmos

Junaid Girkar · Answer 8 · Fri Jul 10 2020 23:55:11 GMT+0800 (China Standard Time)

Can i contribute in any way. I am fluent in Hindi alongside English. Also I may be of help in the programming section. I know Python, C and Java in languages. Am good in front-end with HTML, CSS and JavaScript (basic).

Manmeet Singh · Answer 9 · Sat Jul 11 2020 02:43:29 GMT+0800 (China Standard Time)

I recommend adding Punjabi language which is the 10th most spoken language around the world.
pb_char.txt

Rakpong Kittinaradorn · Answer 10 · Sat Jul 11 2020 15:42:25 GMT+0800 (China Standard Time)

@edloginova After doing human language, we can explore math as well.

@upadhyayprakash Lists are here easyocr/character and easyocr/dict

@arashjafari looks like we already have both words and char. You can recheck if everything is alright.

@junaidgirkar sounds good, I'll keep in mind. May call you for help.

Vijayabhaskar · Answer 11 · Sun Jul 12 2020 00:01:31 GMT+0800 (China Standard Time)

Why won't you share the training code, so people could train the model by themself ?

I agree with this, If the training code and sample dataset are provided, many can train the model for their language. With free GPU services like Google colab and Kaggle Kernels, anyone can train them online and contribute much faster.

Rakpong Kittinaradorn · Answer 12 · Sun Jul 12 2020 08:03:58 GMT+0800 (China Standard Time)

Why won't you share the training code, so people could train the model by themself ?

@yossibiton @Vijayabhaskar96 because it's still not straightforward training process. Even I have to think carefully when creating model. Will share later when it's clean. Please don't pressure me, I am doing a lot of work for free.

Rahil Wazir · Answer 13 · Sun Jul 12 2020 09:59:11 GMT+0800 (China Standard Time)

@rkcosmos Can we add support for the language Urdu? It is very similar with Persian and Arabic (not much complexities of arabic though).

Vijayabhaskar · Answer 14 · Sun Jul 12 2020 11:47:47 GMT+0800 (China Standard Time)

Why won't you share the training code, so people could train the model by themself ?

@yossibiton @Vijayabhaskar96 because it's still not straightforward training process. Even I have to think carefully when creating model. Will share later when it's clean. Please don't pressure me, I am doing a lot of work for free.

Sorry I made you feel this way, I didn't mean to pressurize you. I just wanted to help. Take your time, you're doing great work!

Faisal Nasim · Answer 15 · Sun Jul 12 2020 13:00:33 GMT+0800 (China Standard Time)

@rkcosmos For Group 1, could you please add Urdu to that group? Urdu is very similar to Arabic and Persian and I've just submitted the PR for the character list and a dictionary. So it should be ready to go!

cc: @rahilwazir

Loay · Answer 16 · Mon Jul 13 2020 05:53:04 GMT+0800 (China Standard Time)

This might help for Arabic:

https://github.com/OSINTAI/Arabic_Words

Sumitkumar Sarda · Answer 17 · Mon Jul 13 2020 23:47:06 GMT+0800 (China Standard Time)

i added Marathi character and dictionary data set file please train it
mr.txt

Rakpong Kittinaradorn · Answer 18 · Tue Jul 14 2020 09:37:32 GMT+0800 (China Standard Time)

i added Marathi character and dictionary data set file please train it
mr.txt

@sardasumit did you forget a link for mr_char.txt?

Sumitkumar Sarda · Answer 19 · Tue Jul 14 2020 09:51:59 GMT+0800 (China Standard Time)

i added Marathi character and dictionary data set file please train it
mr.txt

@sardasumit did you forget a link for mr_char.txt?

@rkcosmos it is same like Hindi character
mr_char.txt

Nishad Thalhath · Answer 20 · Wed Jul 15 2020 18:35:52 GMT+0800 (China Standard Time)

@rkcosmos
Malayalam (https://en.wikipedia.org/wiki/Malayalam), belongs to Group 6.
#143
This PR contains character and word lists.

Vladimir Gurevich · Answer 21 · Thu Jul 16 2020 02:04:19 GMT+0800 (China Standard Time)

Hi! Thanks for your work. Some notes about Hebrew, there are some ending form of letters (it means that some letter is changing their form if they are placed at the end of words) https://en.wikipedia.org/wiki/Final_form Also there are diacritical signs https://en.wikipedia.org/wiki/Niqqud that used to represent vowels or distinguish between alternative pronunciations of letters (in Arabic also there are final forms(and not only) and diacritical signs) I didn't provide diacritical signs, assume it's better to train first of all without them (usual writing consists from usual letters without diacritical signs)

Vladimir Gurevich · Answer 22 · Thu Jul 16 2020 07:18:12 GMT+0800 (China Standard Time)

remembered the important thing. in Hebrew, there is cursive(https://en.wikipedia.org/wiki/Cursive_Hebrew) and sometimes people mixed it up together with usual writing even using printed matter, it's the same letters (chars), but let's say it's another font (e.g. https://opensiddur.org/wp-content/uploads/fonts/display-font-charmap.php?fnt=DorianCLM-Italic ) maybe it's also better not to implement immediately, don't know

Rakpong Kittinaradorn · Answer 23 · Thu Jul 16 2020 16:45:15 GMT+0800 (China Standard Time)

@nishad Malayalam and Tamil are both Dravidian but do not use the same script. So I have to build 2 model.
@imvladikon ok, will try to keep this in mind when building Hebrew model.

Rakpong Kittinaradorn · Answer 24 · Thu Jul 16 2020 23:41:34 GMT+0800 (China Standard Time)

Question for Indian: I'm looking into Hindi char and dict, there are a lot of chars seen in word list but not in char list. Examples are
['ा', '्', 'ि', 'ी', 'ं', 'ो', 'ु', 'ँ', 'ू', 'ड़', 'ै']. What are these symbols?

Vijayabhaskar · Answer 25 · Thu Jul 16 2020 23:49:02 GMT+0800 (China Standard Time)

@rkcosmos Those are part of the existing alphabet when combined it creates a new alphabet, I think the technical term is grapheme? I'm not sure. I would like to know they render fine or something happens like it did with Tamil.

Rakpong Kittinaradorn · Answer 26 · Fri Jul 17 2020 11:19:05 GMT+0800 (China Standard Time)

@Vijayabhaskar96 So far, Devanagari doesn't have any problem. They support unicode well.

Vladimir Gurevich · Answer 27 · Fri Jul 17 2020 22:23:07 GMT+0800 (China Standard Time)

another addition about Hebrew;) and it's important. some diacritic signs are important, like geresh and gershayim. using geresh with ג ז צ we could use for the sounds - j g, ch, that are not represented in the alphabet and double geresh (gershayim) it's for widely spread short phrase, words (kitsur) most famous is the תנ"ך (Tanakh). Sometimes people could use usual quotation marks (apostrophe) instead of typing geresh or gershayim (e.g. תנ''ך)

omprakash-jena · Answer 28 · Sun Jul 19 2020 15:48:40 GMT+0800 (China Standard Time)

Can you please simulate it for Odia Language which is also a classical language of India coming under group 6.
My MailID: jena.omprakash@gmail.com
i can provide you the datasets regrading odia language.

Rakpong Kittinaradorn · Answer 29 · Sun Jul 19 2020 15:53:47 GMT+0800 (China Standard Time)

@omprakash-jena You can create a pull request to add files. Or you can also attach files in comment here.

Rakpong Kittinaradorn · Answer 30 · Tue Jul 21 2020 13:10:17 GMT+0800 (China Standard Time)

@Vijayabhaskar96 @sardasumit @junaidgirkar @upadhyayprakash

Question for Indian: I'm testing Devanagari model with this
.
The result is ['50', '40', 'बसझरकर', 'SPEED', 'मािकट', 'LIMIT', 'BASRURKAR', 'MARKET'].

The problem is with मािकट. It doesn't look like what is written in the original image. But when I do
for c in 'मािकट': print(c), I got म ा ि क ट which looks quite right. What's going on here? Is it just the way python render Hindi?

Nishad Thalhath · Answer 31 · Tue Jul 21 2020 18:15:23 GMT+0800 (China Standard Time)

@rkcosmos

Both devanagari strings are identified differently from the image.
They are बसरूरकर and मार्किट.
These characters join (र+ ्+ क+ ि ), and renders as र्कि

Rakpong Kittinaradorn · Answer 32 · Tue Jul 21 2020 18:34:13 GMT+0800 (China Standard Time)

@nishad Wow, this is really hard. It means OCR need to understand how to combine character in a very complex way.

Nishad Thalhath · Answer 33 · Tue Jul 21 2020 18:42:03 GMT+0800 (China Standard Time)

@rkcosmos, this is complex and I am not knowledgeable in explaining this.
@santhoshtr could you please share your expertise ?

Vijayabhaskar · Answer 34 · Tue Jul 21 2020 21:31:17 GMT+0800 (China Standard Time)

@rkcosmos I don't speak Hindi, but this is interesting. I think the problem is hi_char.txt doesn't have all the chars. For example: कि is not there but क and ि are present while क+ ि = कि
If you look at Tamil chars all the alphabets with those extra parts are present instead of just including the root alphabet and the extra parts separately. i.e க + ா = கா, both க and கா are present in ta_char.txt but not ா
I think this is the right way to add characters to the lang_chat.txt file.
There are many combinations here which I think they all+many other should be added as unique characters but just 85 chars exist in ha_char.txt
I think this depends on what people call it as a character, but if कि exists in the hi_char.txt it will be easy for the network to simply use it instead of figuring out the order of all the parts that constitute the alphabet right?

I might be wrong here about Hindi alphabets as I don't speak the language, do correct me if I'm wrong @nishad

Rakpong Kittinaradorn · Answer 35 · Tue Jul 21 2020 22:57:05 GMT+0800 (China Standard Time)

@Vijayabhaskar96 It depends on how many combinations are there for each language. It's possible to do both ways. For example, I did Thai before with separated characters. We have something like ท + ี + ่ = ที่. First three characters are in the list but the combined form is not. For Thai, I use separated character because number of all possible combinations is just too much to imagine. For Tamil, you have 325 combined forms, I think it should be doable. Now for Hindi, they have र्कि which is a combined form of 4 separated characters! I would guess their number of all possible combined form is extremely large. So we might have to go with separated char way. I just hope that my current neural network's setting can learn such complexity. Will let everyone know in a few days if it works or not.

Vijayabhaskar · Answer 36 · Tue Jul 21 2020 23:50:52 GMT+0800 (China Standard Time)

According to the link In my previous comment there about 500 combinations in Hindi, I don't think even with all combinations that aren't in that link it will exceed 2000+ characters, Telugu list has 2000+ chars. So try what works best, findings from this experiment may simplify other language training approaches.

Vladimir Gurevich · Answer 37 · Wed Jul 22 2020 00:49:44 GMT+0800 (China Standard Time)

Belarusian is ready? commit probably list need update

Rakpong Kittinaradorn · Answer 38 · Fri Jul 24 2020 10:38:12 GMT+0800 (China Standard Time)

Version 1.1.5 now support Devanagari script. Please test, feedback and spread the words to your community.

Sumitkumar Sarda · Answer 39 · Fri Jul 24 2020 10:43:18 GMT+0800 (China Standard Time)

Hello Please send me training code I will look into this issue. Thank you Sent from Yahoo Mail on Android On Tue, 21 Jul 2020 at 7:01 pm, Vijayabhaskar<notifications@github.com> wrote: @rkcosmos I don't speak Hindi, but this is interesting. I think the problem is hi_char.txt doesn't have all the chars. For example: र् is not there but र and ् are present while र + ् = र् If you look at Tamil chars all the alphabets with those extra parts are present instead of just including the root alphabet and the extra parts separately. i.e க + ா = கா, both க and கா are present in ta_char.txt but not ா — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Saleh Souzanchi · Answer 40 · Sat Jul 25 2020 03:39:20 GMT+0800 (China Standard Time)

Hello dear
Can i help for the Persian (Farsi) language ?
-I can supply some popular words and characters and also i working in the field of Persian typeface design & Libre persian fonts

Rakpong Kittinaradorn · Answer 41 · Sat Jul 25 2020 06:09:00 GMT+0800 (China Standard Time)

@zoghal please add more words to easyocr/character/fa.txt

Rakpong Kittinaradorn · Answer 42 · Wed Jul 29 2020 06:33:28 GMT+0800 (China Standard Time)

Cyrillic script (Russian, Serbian(cyrillic), Belarusian, Bulgarian, Mongolian, Ukrainian) is available for testing in last update (not on pip yet), please test and feedback.

Vladimir Gurevich · Answer 43 · Thu Jul 30 2020 04:02:01 GMT+0800 (China Standard Time)

Belarusian, Russian, Ukrainian:
https://colab.research.google.com/drive/1Sy1endzzbommR2b5pyw7YIoa8CIV-jkN?usp=sharing
the quality is not so bad(but not best, in the Colab wiki-test for Ukrainian and Belarusian is far away from best. Belarusian model could not split sometimes words correctly), in Ukrainian, there is an additional sign, apostrophe ’ for softness or on the contrary for emphasizing hard consonant (like ім'я) depends on some phonetical rules, cases.

Rakpong Kittinaradorn · Answer 44 · Thu Jul 30 2020 08:57:43 GMT+0800 (China Standard Time)

@imvladikon Thanks for the analysis, it's very useful. I'll keep in mind the special character issue for next fine-tuning. For low-resolution image, you might need to change some parameters to get better result. For example, I would try mag_ratio = 1.2 (that's zooming by 20%). One thing I don't understand in your colab is with expected result of ulica_be.jpg. It's the phase 'лінгвістичний'. Is it a typo or your language has special rule to combine characters?

Vladimir Gurevich · Answer 45 · Thu Jul 30 2020 14:21:04 GMT+0800 (China Standard Time)

@rkcosmos yeah, it's a typo) for the Belarusian language should be "лінгвістычны", fixed it. this word accidentally was written on Ukrainian "лінгвістичний" ("linguistic")

Rakpong Kittinaradorn · Answer 46 · Tue Aug 04 2020 12:40:25 GMT+0800 (China Standard Time)

Arabic is now supported in v1.1.6. To use the last version, you need to uninstall first
pip uninstall easyocr

and install directly from source code
pip install git+git://github.com/jaidedai/easyocr.git

please try and feedback.

Vijayabhaskar · Answer 47 · Tue Aug 04 2020 12:43:40 GMT+0800 (China Standard Time)

You can do that in a single command itself.
pip install git+git://github.com/jaidedai/easyocr.git --upgrade no need to uninstall manually.

Vijayabhaskar · Answer 48 · Tue Aug 11 2020 03:09:59 GMT+0800 (China Standard Time)

@rkcosmos Thanks for adding support for Tamil, tested few images and it worked well, some images required fiddling with the args for better results, but anyway great job for overcoming the unicode issues, may I know what you did? And if you used pyvips as I suggested how did it go? Can you explain shortly?

Faisal Nasim · Answer 49 · Tue Aug 11 2020 03:12:12 GMT+0800 (China Standard Time)

@rkcosmos You marked Urdu as done (great news!). Can I try it in the latest?

Rahil Wazir · Answer 50 · Tue Aug 11 2020 03:42:42 GMT+0800 (China Standard Time)

@fnasim @rkcosmos I've been trying to test urdu with multiple variations, although it's not completely there yet, but it's close. There are still some subtle differences.

The بھائی is recognized as میں ,بجال is recognized as لکھیں ,ئیں recognized as اکھیں etc.
I will continue to test more and share the results.

Faisal Nasim · Answer 51 · Tue Aug 11 2020 03:58:51 GMT+0800 (China Standard Time)

@rahilwazir Thank you for testing! Please keep a list of discrepancies and always keep track of the image you used (minimal is better). I tried a screen grab from bbc.com/urdu and it worked pretty well (100% accuracy based on quick visual inspection - GREAT start, kudos @rkcosmos ).

Faisal Nasim · Answer 52 · Sat Aug 15 2020 15:22:55 GMT+0800 (China Standard Time)

@rahilwazir here's the test I did - https://codepen.io/fnasim/pen/OJNNmPg. I've embedded the image and output json from easyocr on there so just hit 'Process' to see rectangles. Please feel free to click them.

@rkcosmos the OCR is pretty accurate but the confidence score is extremely low. Do you know why?

ai-motive · Answer 53 · Wed Sep 02 2020 09:20:25 GMT+0800 (China Standard Time)

If the math model is updated, what is the output of the input image?

==> ?

NeighborhoodCoding · Answer 54 · Wed Sep 02 2020 12:48:46 GMT+0800 (China Standard Time)

Can I run in Android via TF-lite?

Thamme Gowda · Answer 55 · Wed Sep 16 2020 11:37:34 GMT+0800 (China Standard Time)

@rkcosmos Thanks for this awesome work.
I see that in Group 5, Kannada is listed as "NEED HELP"
NOTE: I'd already sent PR with Kannada vocabulary #124

we appreciate it if you could please move Kannada up in your priority.
Is there anything I can help you to speed up training the first version of OCR model for Kannada? Thx

Danial Zakaria · Answer 56 · Tue Oct 20 2020 05:44:31 GMT+0800 (China Standard Time)

@rkcosmos Any update on this work? It's been a while since I have added the Abkhazian language!

Haowen Jiang · Answer 57 · Tue Nov 10 2020 16:56:10 GMT+0800 (China Standard Time)

I noticed there's no date for Group 7. I'm just wondering when "Chinese version2 + vertical text" will be released. The current Chinese model is excellent in many ways, but doesn't work for vertical texts. Thanks for keeping improving!

Rakpong Kittinaradorn · Answer 58 · Tue Nov 10 2020 17:37:10 GMT+0800 (China Standard Time)

Hi everyone, sorry for a delay. We have been testing smaller network for faster processing time. Apart from this we also need to do things that can feed our mouths. I can't promise any timeline but we are working.

btw @thammegowda Kannada and Telugu characters may look similar but they are using different unicode character set. I may have to separate them into 2 models.

SE+AI · Answer 59 · Tue Dec 01 2020 09:55:18 GMT+0800 (China Standard Time)

what a great todo-list! Any plans or timelines to supporting Myanmar/Burmese language?
It is similar to Thai. So I think it's belongs to Group 6 (Language that doesn't share characters with others).

Anass Kartit · Answer 60 · Fri Dec 18 2020 00:26:06 GMT+0800 (China Standard Time)

any plans for french? how to help?

alirezamohammadi01 · Answer 61 · Sat Jan 23 2021 04:30:27 GMT+0800 (China Standard Time)

hi how can i use Persian language ???
reader = easyocr.Reader(['ch_sim','fa'])
result = reader.readtext('chinese.jpg')
why not work?

Olar Alex · Answer 62 · Tue Feb 02 2021 19:07:00 GMT+0800 (China Standard Time)

Even if it is messy, I think a lot of us would be really happy if you shared your training code as well. Would it be possible in any way? Thanks in advance.

KhayrulloevDD · Answer 63 · Tue Mar 30 2021 13:03:44 GMT+0800 (China Standard Time)

For Group 4
Tajik
tajik.zip

KhayrulloevDD · Answer 64 · Thu Apr 01 2021 16:57:01 GMT+0800 (China Standard Time)

Hello there! Can you guys give me some feedback, so i know that had send you correct data for training tajik language? Thank you!

tsaidevin · Answer 65 · Thu Apr 15 2021 16:40:58 GMT+0800 (China Standard Time)

Excuse me, will there be a chi_tra version2 _?

Rakpong Kittinaradorn · Answer 66 · Thu Apr 15 2021 18:46:29 GMT+0800 (China Standard Time)

@tsaidevin yes.

Abhishek Verma · Answer 67 · Tue Apr 20 2021 11:19:17 GMT+0800 (China Standard Time)

Are model training scripts not there? If somebody wants to train on a new language. How can one contribute to betterment of model?

DaniSubodh · Answer 68 · Thu Apr 22 2021 02:04:38 GMT+0800 (China Standard Time)

Hi @rkcosmos, really impressive work with the OCR framework.
I couldn't find a code for Greek language here, but I see (Greek + math) in the development list above. Do you know if the Greek language itself will be supported anytime soon?

Phon Vanna · Answer 69 · Fri May 14 2021 22:38:47 GMT+0800 (China Standard Time)

Hi @rkcosmos , really impressive work.
I am not sure whether you will add khmer language to it?

Àlex Solé Gómez · Answer 70 · Mon Jun 14 2021 18:20:20 GMT+0800 (China Standard Time)

Hi @rkcosmos,

We created a dictionary and dict_char for the greek language, you can find it here

Karma Wangchuk · Answer 71 · Sun Aug 01 2021 13:09:07 GMT+0800 (China Standard Time)

Can you add Dzongkha? Dzongkha is the national language of Bhutan and it is similar to the Tibetan Language. Similar to Thai language, it is written continuously from left to right and does not have a whitespace between words. Following paper discusses on next syllabus prediction for Dzongkha. https://doi.org/10.1016/j.jksuci.2021.01.001

weihaulee · Answer 72 · Mon Sep 06 2021 11:57:38 GMT+0800 (China Standard Time)

Hi @rkcosmos,
Is there any plan to add English （vertical text） and digits （vertical text）?

Thanks in advance.

vneseresearcher · Answer 73 · Wed Sep 29 2021 10:04:19 GMT+0800 (China Standard Time)

Hi @rkcosmos , Can you share the Japanese dataset you used to train? Thanks a lot!

Ajeet Mishra · Answer 74 · Mon Oct 18 2021 02:27:18 GMT+0800 (China Standard Time)

Hi @rkcosmos. Thank you very much for the efforts you are taking. Is there a plan to include the Indian languages - Gujarati and Oriya ?

amirashe · Answer 75 · Thu Nov 11 2021 00:42:30 GMT+0800 (China Standard Time)

hii @rkcosmos do you know if the hebrew will be ready soon? thnak a lot!

Fatemeh sadat Hosseini · Answer 76 · Mon Dec 06 2021 00:36:31 GMT+0800 (China Standard Time)

Hello, @rkcosmos thank you for your great job.
I ran version 1.4.1 for Persian(Farsi), but the problem that I am facing is, Persian is written and read from right to left, but the model detects words from left to right. This confuses text detection and reduces the accuracy.
is there any solution to fix this issue?

Bereket Abraham · Answer 77 · Thu Dec 09 2021 17:53:44 GMT+0800 (China Standard Time)

Hi @rkcosmos, thanks so much for all the work you've put in. I've included a PR for the Amharic language, which is spoken by over 60 million people.
#616

One potential issue is that Amharic words contain a number of prefixes and suffixes to indicate the object, number of items, tense, gender, negation and so. Thus, a single verb may morph in a number of ways that are not all included in the dictionary.

Bereket Abraham · Answer 78 · Thu Dec 09 2021 17:57:05 GMT+0800 (China Standard Time)

Hi @rkcosmos, I also submitted a PR for the Tigrinya language, which is similar to Amharic and spoken by over 10 million people.
#615

It has the same mutation issue as Amharic. Also, Arabic numerals are very common despite having its own numeral system.

98-Jane · Answer 79 · Thu Feb 10 2022 12:09:01 GMT+0800 (China Standard Time)

@rkcosmos Question: Why Chinese dict is pinning rather than Chinese? In the dict folder, cannot find the Chinese dict(not pinying)?How to achieve this mapping relationship? If I want to add some words in Chinese dict, how do I add training data and dict?

Reema Shrestha · Answer 80 · Tue Aug 02 2022 23:53:03 GMT+0800 (China Standard Time)

@rkcosmos is Greek language updated? I saw someone contributing for greek in the comment.

Saranga Kumarapeli · Answer 81 · Thu Sep 01 2022 17:36:08 GMT+0800 (China Standard Time)

does easyocr support Sinhala language?

amits-ds · Answer 82 · Mon Apr 24 2023 16:10:55 GMT+0800 (China Standard Time)

Hey, thank you for this Repo 🙏
Is there an update on a model for Hebrew OCR?

minamohamadii · Answer 83 · Tue May 02 2023 14:35:17 GMT+0800 (China Standard Time)

i want use farsi language but i see it is not fine tune on 5 farsi and fine tune on 5 arabic
5 farsi = ۵
5 arabic = ٥
and it become a problem for me because it show me 0 in persian show me ٥ can you please improve farsi language ??? it help me so much

ashudhatma · Answer 84 · Tue Nov 28 2023 19:13:08 GMT+0800 (China Standard Time)

For Group 3 (Devanagari)
Request to add Gujarati Language
gu.txt
gu_char.txt

Nikos Mermigas · Answer 85 · Thu Feb 15 2024 04:56:42 GMT+0800 (China Standard Time)

Hey, I saw that the issue about the support of the greek language is completed and I can see the two required .txt documents about greek. However, I cannot find the greek language code ('gre' in the repo) in the list with the supported languages that is on the website. Is greek actually supported?

Haroon khan · Answer 86 · Mon Apr 08 2024 12:28:20 GMT+0800 (China Standard Time)

urdu language not supported Easyocr model ?

Aynaz Rafiei · Answer 87 · Sat May 18 2024 22:23:12 GMT+0800 (China Standard Time)

I will update/edit this issue to track development process of new language. The current list is

Group 1 (Arabic script)

Arabic (DONE, August, 5 2020)

Uyghur (DONE, August, 5 2020)

Persian (DONE, August, 5 2020)

Urdu (DONE, August, 5 2020)

Group 2 (Latin script)

Serbian-latin (DONE, July,12 2020)

Occitan (DONE, July,12 2020)

Group 3 (Devanagari)

Hindi (DONE, July,24 2020)

Marathi (DONE, July,24 2020)

Nepali (DONE, July,24 2020)

Rajasthani (NEED HELP)

Awadhi, Haryanvi, Sanskrit (if possible)

Group 4 (Cyrillic script)

Russian (DONE, July,29 2020)

Serbian-cyrillic (DONE, July,29 2020)

Bulgarian (DONE, July,29 2020)

Ukranian (DONE, July,29 2020)

Mongolian (DONE, July,29 2020)

Belarusian (DONE, July,29 2020)

Tajik (DONE, April,20 2021)

Kyrgyz (NEED HELP)

Group 5

Telugu (DONE, November,17 2020)

Kannada (DONE, November,17 2020)

Group 6 (Language that doesn't share characters with others)

Tamil (DONE, August, 10 2020)

Hebrew (ready to train)

Malayalam (ready to train)

Bengali + Assamese (DONE, August, 23 2020)

Punjabi (ready to train)

Abkhaz (ready to train)

Group 7 (Improvement and possible extra models)

Japanese version 2 (DONE, March, 21 2021)+ vertical text

Chinese version2 (DONE, March, 21 2021)+ vertical text

Korean version 2(DONE, March, 21 2021)

Latin version 2 (DONE, March, 21 2021)

Math + Greek?

Number+symbol only

Guideline for new language request

To request a new language support, I need you to send a PR with 2 following files

In folder easyocr/character, we need 'yourlanguagecode_char.txt' that contains list of all characters. Please see format/example from other files in that folder.

In folder easyocr/dict, we need 'yourlanguagecode.txt' that contains list of words in your language. On average we have ~30000 words per language with more than 50000 words for popular one. More is better in this file.

If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.

Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.

There is a misstake in name of Group 1.It has to be Persian scripts.If you search you will see that Persian is the mother language of others and the rest Arabic, Urdu and Uyghur were taken from it(Persian Language).

Bereket Abraham · Answer 88 · Tue May 21 2024 15:19:40 GMT+0800 (China Standard Time)

Please let me know if Amharic or Tigrinya can be added, thanks! @AinazRafiei
#91 (comment)
#91 (comment)

DanielVegaVega · Answer 89 · Thu Jul 04 2024 16:48:55 GMT+0800 (China Standard Time)

Hey, I saw that the issue about the support of the greek language is completed and I can see the two required .txt documents about greek. However, I cannot find the greek language code ('gre' in the repo) in the list with the supported languages that is on the website. Is greek actually supported?

@nmermigas looking for Greek as well. Could you find a way to "train" EasyOCR for it? Or is it something that the developer team must train?

Iordanis Sapidis · Answer 90 · Tue Jul 09 2024 14:36:14 GMT+0800 (China Standard Time)

I am also looking for Greek.