JaidedAI / EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Home Page:https://www.jaided.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

List of languages in development

rkcosmos opened this issue · comments

I will update/edit this issue to track development process of new language. The current list is

Group 1 (Arabic script)

  • Arabic (DONE, August, 5 2020)
  • Uyghur (DONE, August, 5 2020)
  • Persian (DONE, August, 5 2020)
  • Urdu (DONE, August, 5 2020)

Group 2 (Latin script)

  • Serbian-latin (DONE, July,12 2020)
  • Occitan (DONE, July,12 2020)

Group 3 (Devanagari)

  • Hindi (DONE, July,24 2020)
  • Marathi (DONE, July,24 2020)
  • Nepali (DONE, July,24 2020)
  • Rajasthani (NEED HELP)
  • Awadhi, Haryanvi, Sanskrit (if possible)

Group 4 (Cyrillic script)

  • Russian (DONE, July,29 2020)
  • Serbian-cyrillic (DONE, July,29 2020)
  • Bulgarian (DONE, July,29 2020)
  • Ukranian (DONE, July,29 2020)
  • Mongolian (DONE, July,29 2020)
  • Belarusian (DONE, July,29 2020)
  • Tajik (DONE, April,20 2021)
  • Kyrgyz (NEED HELP)

Group 5

  • Telugu (DONE, November,17 2020)
  • Kannada (DONE, November,17 2020)

Group 6 (Language that doesn't share characters with others)

  • Tamil (DONE, August, 10 2020)
  • Hebrew (ready to train)
  • Malayalam (ready to train)
  • Bengali + Assamese (DONE, August, 23 2020)
  • Punjabi (ready to train)
  • Abkhaz (ready to train)

Group 7 (Improvement and possible extra models)

  • Japanese version 2 (DONE, March, 21 2021)+ vertical text
  • Chinese version2 (DONE, March, 21 2021)+ vertical text
  • Korean version 2(DONE, March, 21 2021)
  • Latin version 2 (DONE, March, 21 2021)
  • Math + Greek?
  • Number+symbol only

Guideline for new language request

To request a new language support, I need you to send a PR with 2 following files

  1. In folder easyocr/character, we need 'yourlanguagecode_char.txt' that contains list of all characters. Please see format/example from other files in that folder.
  2. In folder easyocr/dict, we need 'yourlanguagecode.txt' that contains list of words in your language. On average we have ~30000 words per language with more than 50000 words for popular one. More is better in this file.

If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.

Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.

Why won't you share the training code, so people could train the model by themself ?

For group 4 you could add Ukrainian, Bulgarian, and may be Mongolian, although it is not Slavic it uses Cyrillic script.

Do you plan to only work with human languages? It would be amazing to add a model to recognize mathematical formulas.

I guess Tamil, Telugu can be added to one group because they belongs to a language group called 'Dravidian'. Meaning they relate to each other in terms of grammar, word arrangement.Two other popular( in India) language, which belong to that family can also be added to that group— Kannada and Malayalam (For further info— https://en.m.wikipedia.org/wiki/Dravidian_languages). Moreover Telugu and kannada share some common alphabets and words. I will be adding alphabet and words of kannada language for language request.
Great project, keep it up👍

For Group 4
Bulgarian
dict bg.txt
char bg_char.txt

I'd highly recomend supporting Devanagiri Script (Wiki - https://en.wikipedia.org/wiki/Devanagari), which is the fourth most widely adopted writing system in the world. Please go through the wikipedia link to understand its wide spread usage across most Ancient Languages including Sanskrit, Hindi, Marathi, Awadhi, Haryanvi.

I see you have included "Hindi" as a target language, which of course, is the most spoken language in the Indian Subcontinent.

If you could let me know what's the current word-count you have (maybe share the "dict" & "alphabets" directory), I can continue with the research to share more details about the Language as it's my First Language.

Hindi has 47 primary alphabets (including 14 Vowels & 33 Consonants).

You can contact me @ prakash.upadhyay93@yahoo.com

Can i help for the Persian (Farsi) language ? I can supply some popular words and characters

@rkcosmos

Can i contribute in any way. I am fluent in Hindi alongside English. Also I may be of help in the programming section. I know Python, C and Java in languages. Am good in front-end with HTML, CSS and JavaScript (basic).

I recommend adding Punjabi language which is the 10th most spoken language around the world.
pb_char.txt

@edloginova After doing human language, we can explore math as well.

@upadhyayprakash Lists are here easyocr/character and easyocr/dict

@arashjafari looks like we already have both words and char. You can recheck if everything is alright.

@junaidgirkar sounds good, I'll keep in mind. May call you for help.

Why won't you share the training code, so people could train the model by themself ?

I agree with this, If the training code and sample dataset are provided, many can train the model for their language. With free GPU services like Google colab and Kaggle Kernels, anyone can train them online and contribute much faster.

Why won't you share the training code, so people could train the model by themself ?

@yossibiton @Vijayabhaskar96 because it's still not straightforward training process. Even I have to think carefully when creating model. Will share later when it's clean. Please don't pressure me, I am doing a lot of work for free.

@rkcosmos Can we add support for the language Urdu? It is very similar with Persian and Arabic (not much complexities of arabic though).

Why won't you share the training code, so people could train the model by themself ?

@yossibiton @Vijayabhaskar96 because it's still not straightforward training process. Even I have to think carefully when creating model. Will share later when it's clean. Please don't pressure me, I am doing a lot of work for free.

Sorry I made you feel this way, I didn't mean to pressurize you. I just wanted to help. Take your time, you're doing great work!

@rkcosmos For Group 1, could you please add Urdu to that group? Urdu is very similar to Arabic and Persian and I've just submitted the PR for the character list and a dictionary. So it should be ready to go!

cc: @rahilwazir

commented

This might help for Arabic:

https://github.com/OSINTAI/Arabic_Words

i added Marathi character and dictionary data set file please train it
mr.txt

i added Marathi character and dictionary data set file please train it
mr.txt

@sardasumit did you forget a link for mr_char.txt?

i added Marathi character and dictionary data set file please train it
mr.txt

@sardasumit did you forget a link for mr_char.txt?

@rkcosmos it is same like Hindi character
mr_char.txt

@rkcosmos
Malayalam (https://en.wikipedia.org/wiki/Malayalam), belongs to Group 6.
#143
This PR contains character and word lists.

Hi! Thanks for your work. Some notes about Hebrew, there are some ending form of letters (it means that some letter is changing their form if they are placed at the end of words) https://en.wikipedia.org/wiki/Final_form Also there are diacritical signs https://en.wikipedia.org/wiki/Niqqud that used to represent vowels or distinguish between alternative pronunciations of letters (in Arabic also there are final forms(and not only) and diacritical signs) I didn't provide diacritical signs, assume it's better to train first of all without them (usual writing consists from usual letters without diacritical signs)

remembered the important thing. in Hebrew, there is cursive(https://en.wikipedia.org/wiki/Cursive_Hebrew) and sometimes people mixed it up together with usual writing even using printed matter, it's the same letters (chars), but let's say it's another font (e.g. https://opensiddur.org/wp-content/uploads/fonts/display-font-charmap.php?fnt=DorianCLM-Italic ) maybe it's also better not to implement immediately, don't know

@nishad Malayalam and Tamil are both Dravidian but do not use the same script. So I have to build 2 model.
@imvladikon ok, will try to keep this in mind when building Hebrew model.

Question for Indian: I'm looking into Hindi char and dict, there are a lot of chars seen in word list but not in char list. Examples are
['ा', '्', 'ि', 'ी', 'ं', 'ो', 'ु', 'ँ', 'ू', 'ड़', 'ै']. What are these symbols?

@rkcosmos Those are part of the existing alphabet when combined it creates a new alphabet, I think the technical term is grapheme? I'm not sure. I would like to know they render fine or something happens like it did with Tamil.

@Vijayabhaskar96 So far, Devanagari doesn't have any problem. They support unicode well.

another addition about Hebrew;) and it's important. some diacritic signs are important, like geresh and gershayim. using geresh with ג ז צ we could use for the sounds - j g, ch, that are not represented in the alphabet and double geresh (gershayim) it's for widely spread short phrase, words (kitsur) most famous is the תנ"ך (Tanakh). Sometimes people could use usual quotation marks (apostrophe) instead of typing geresh or gershayim (e.g. תנ''ך)

Can you please simulate it for Odia Language which is also a classical language of India coming under group 6.
My MailID: jena.omprakash@gmail.com
i can provide you the datasets regrading odia language.

@omprakash-jena You can create a pull request to add files. Or you can also attach files in comment here.

@Vijayabhaskar96 @sardasumit @junaidgirkar @upadhyayprakash

Question for Indian: I'm testing Devanagari model with this
hi1 .
The result is ['50', '40', 'बसझरकर', 'SPEED', 'मािकट', 'LIMIT', 'BASRURKAR', 'MARKET'].

The problem is with मािकट. It doesn't look like what is written in the original image. But when I do
for c in 'मािकट': print(c), I got म ा ि क ट which looks quite right. What's going on here? Is it just the way python render Hindi?

@rkcosmos

Both devanagari strings are identified differently from the image.
They are बसरूरकर and मार्किट.
These characters join (र+ ्+ क+ ि ), and renders as र्कि

@nishad Wow, this is really hard. It means OCR need to understand how to combine character in a very complex way.

@rkcosmos, this is complex and I am not knowledgeable in explaining this.
@santhoshtr could you please share your expertise ?

@rkcosmos I don't speak Hindi, but this is interesting. I think the problem is hi_char.txt doesn't have all the chars. For example: कि is not there but क and ि are present while क+ ि = कि
If you look at Tamil chars all the alphabets with those extra parts are present instead of just including the root alphabet and the extra parts separately. i.e க + ா = கா, both க and கா are present in ta_char.txt but not ா
I think this is the right way to add characters to the lang_chat.txt file.
There are many combinations here which I think they all+many other should be added as unique characters but just 85 chars exist in ha_char.txt
I think this depends on what people call it as a character, but if कि exists in the hi_char.txt it will be easy for the network to simply use it instead of figuring out the order of all the parts that constitute the alphabet right?

I might be wrong here about Hindi alphabets as I don't speak the language, do correct me if I'm wrong @nishad

@Vijayabhaskar96 It depends on how many combinations are there for each language. It's possible to do both ways. For example, I did Thai before with separated characters. We have something like ท + ี + ่ = ที่. First three characters are in the list but the combined form is not. For Thai, I use separated character because number of all possible combinations is just too much to imagine. For Tamil, you have 325 combined forms, I think it should be doable. Now for Hindi, they have र्कि which is a combined form of 4 separated characters! I would guess their number of all possible combined form is extremely large. So we might have to go with separated char way. I just hope that my current neural network's setting can learn such complexity. Will let everyone know in a few days if it works or not.

According to the link In my previous comment there about 500 combinations in Hindi, I don't think even with all combinations that aren't in that link it will exceed 2000+ characters, Telugu list has 2000+ chars. So try what works best, findings from this experiment may simplify other language training approaches.

Belarusian is ready? commit probably list need update

example3

Version 1.1.5 now support Devanagari script. Please test, feedback and spread the words to your community.

Hello dear
Can i help for the Persian (Farsi) language ?
-I can supply some popular words and characters and also i working in the field of Persian typeface design & Libre persian fonts

Cyrillic script (Russian, Serbian(cyrillic), Belarusian, Bulgarian, Mongolian, Ukrainian) is available for testing in last update (not on pip yet), please test and feedback.

Belarusian, Russian, Ukrainian:
https://colab.research.google.com/drive/1Sy1endzzbommR2b5pyw7YIoa8CIV-jkN?usp=sharing
the quality is not so bad(but not best, in the Colab wiki-test for Ukrainian and Belarusian is far away from best. Belarusian model could not split sometimes words correctly), in Ukrainian, there is an additional sign, apostrophe ’ for softness or on the contrary for emphasizing hard consonant (like ім'я) depends on some phonetical rules, cases.

@imvladikon Thanks for the analysis, it's very useful. I'll keep in mind the special character issue for next fine-tuning. For low-resolution image, you might need to change some parameters to get better result. For example, I would try mag_ratio = 1.2 (that's zooming by 20%). One thing I don't understand in your colab is with expected result of ulica_be.jpg. It's the phase 'лінгвістичний'. Is it a typo or your language has special rule to combine characters?

@rkcosmos yeah, it's a typo) for the Belarusian language should be "лінгвістычны", fixed it. this word accidentally was written on Ukrainian "лінгвістичний" ("linguistic")

Arabic is now supported in v1.1.6. To use the last version, you need to uninstall first
pip uninstall easyocr

and install directly from source code
pip install git+git://github.com/jaidedai/easyocr.git

please try and feedback.

You can do that in a single command itself.
pip install git+git://github.com/jaidedai/easyocr.git --upgrade no need to uninstall manually.

@rkcosmos Thanks for adding support for Tamil, tested few images and it worked well, some images required fiddling with the args for better results, but anyway great job for overcoming the unicode issues, may I know what you did? And if you used pyvips as I suggested how did it go? Can you explain shortly?

@rkcosmos You marked Urdu as done (great news!). Can I try it in the latest?

@fnasim @rkcosmos I've been trying to test urdu with multiple variations, although it's not completely there yet, but it's close. There are still some subtle differences.

The بھائی is recognized as میں ,بجال is recognized as لکھیں ,ئیں recognized as اکھیں etc.
I will continue to test more and share the results.

@rahilwazir Thank you for testing! Please keep a list of discrepancies and always keep track of the image you used (minimal is better). I tried a screen grab from bbc.com/urdu and it worked pretty well (100% accuracy based on quick visual inspection - GREAT start, kudos @rkcosmos ).

@rahilwazir here's the test I did - https://codepen.io/fnasim/pen/OJNNmPg. I've embedded the image and output json from easyocr on there so just hit 'Process' to see rectangles. Please feel free to click them.

@rkcosmos the OCR is pretty accurate but the confidence score is extremely low. Do you know why?

If the math model is updated, what is the output of the input image?

스크린샷, 2020-09-02 10-17-02 ==> ?

Can I run in Android via TF-lite?

@rkcosmos Thanks for this awesome work.
I see that in Group 5, Kannada is listed as "NEED HELP"
NOTE: I'd already sent PR with Kannada vocabulary #124

we appreciate it if you could please move Kannada up in your priority.
Is there anything I can help you to speed up training the first version of OCR model for Kannada? Thx

@rkcosmos Any update on this work? It's been a while since I have added the Abkhazian language!

I noticed there's no date for Group 7. I'm just wondering when "Chinese version2 + vertical text" will be released. The current Chinese model is excellent in many ways, but doesn't work for vertical texts. Thanks for keeping improving!

Hi everyone, sorry for a delay. We have been testing smaller network for faster processing time. Apart from this we also need to do things that can feed our mouths. I can't promise any timeline but we are working.

btw @thammegowda Kannada and Telugu characters may look similar but they are using different unicode character set. I may have to separate them into 2 models.

commented

what a great todo-list! Any plans or timelines to supporting Myanmar/Burmese language?
It is similar to Thai. So I think it's belongs to Group 6 (Language that doesn't share characters with others).

any plans for french? how to help?

hi how can i use Persian language ???
reader = easyocr.Reader(['ch_sim','fa'])
result = reader.readtext('chinese.jpg')
why not work?

Even if it is messy, I think a lot of us would be really happy if you shared your training code as well. Would it be possible in any way? Thanks in advance.

For Group 4
Tajik
tajik.zip

Hello there! Can you guys give me some feedback, so i know that had send you correct data for training tajik language? Thank you!

Excuse me, will there be a chi_tra version2 _?

Are model training scripts not there? If somebody wants to train on a new language. How can one contribute to betterment of model?

Hi @rkcosmos, really impressive work with the OCR framework.
I couldn't find a code for Greek language here, but I see (Greek + math) in the development list above. Do you know if the Greek language itself will be supported anytime soon?

Hi @rkcosmos , really impressive work.
I am not sure whether you will add khmer language to it?

Hi @rkcosmos,

We created a dictionary and dict_char for the greek language, you can find it here

Can you add Dzongkha? Dzongkha is the national language of Bhutan and it is similar to the Tibetan Language. Similar to Thai language, it is written continuously from left to right and does not have a whitespace between words. Following paper discusses on next syllabus prediction for Dzongkha. https://doi.org/10.1016/j.jksuci.2021.01.001

Hi @rkcosmos,
Is there any plan to add English (vertical text) and digits (vertical text)?

Thanks in advance.

Hi @rkcosmos , Can you share the Japanese dataset you used to train? Thanks a lot!

Hi @rkcosmos. Thank you very much for the efforts you are taking. Is there a plan to include the Indian languages - Gujarati and Oriya ?

hii @rkcosmos do you know if the hebrew will be ready soon? thnak a lot!

Hello, @rkcosmos thank you for your great job.
I ran version 1.4.1 for Persian(Farsi), but the problem that I am facing is, Persian is written and read from right to left, but the model detects words from left to right. This confuses text detection and reduces the accuracy.
is there any solution to fix this issue?

Hi @rkcosmos, thanks so much for all the work you've put in. I've included a PR for the Amharic language, which is spoken by over 60 million people.
#616

One potential issue is that Amharic words contain a number of prefixes and suffixes to indicate the object, number of items, tense, gender, negation and so. Thus, a single verb may morph in a number of ways that are not all included in the dictionary.

Hi @rkcosmos, I also submitted a PR for the Tigrinya language, which is similar to Amharic and spoken by over 10 million people.
#615

It has the same mutation issue as Amharic. Also, Arabic numerals are very common despite having its own numeral system.

@rkcosmos Question: Why Chinese dict is pinning rather than Chinese? In the dict folder, cannot find the Chinese dict(not pinying)?How to achieve this mapping relationship? If I want to add some words in Chinese dict, how do I add training data and dict?

@rkcosmos is Greek language updated? I saw someone contributing for greek in the comment.

does easyocr support Sinhala language?