Please add the API call to translate the language code to the full language name

Question

Please add the API call to translate the language code to the full language name

yurivict opened this issue 3 months ago · comments

Your Feature Request

Functions like GetAvailableLanguagesAsVector return language codes.
There's a page listing all languages with their full names, but these full names don't seem to be available through the API.

Could you please add an API call that would return the full language name for the language code?

Thank you,
Yuri

Stefan Weil · Answer 1 · Thu Mar 07 2024 16:19:59 GMT+0800 (China Standard Time)

Most language codes are ISO 639-2 codes. Use ICU4C to translate such names.

Code for language_code_to_name.cpp:

#include <iostream>
#include <unicode/locid.h>

std::string getLanguageFullName(const std::string& languageCode) {
    icu::UnicodeString lc = languageCode.c_str();
    icu::Locale locale(languageCode.c_str());
    icu::UnicodeString ln = locale.getDisplayName(lc);
    std::string s;
    ln.toUTF8String(s);
    return s;
}

int main(int argc, char *argv[]) {
    std::string languageCode = argv[1];
    std::string languageName = getLanguageFullName(languageCode);
    std::cout << languageName << std::endl;
    return 0;
}

Compile it with g++ -o language_code_to_name language_code_to_name.cpp -licui18n -licuuc -licudata.

Then run it with all traineddata files:

for l in $(ls *.traineddata|sed s/.traineddata//); do echo $l - $(LANG=C.UTF-8 ./language_code_to_name $l); done
afr - Afrikaans
amh - Amharic
ara - Arabic
asm - Assamese
aze_cyrl - Azerbaijani (Cyrillic)
aze - Azerbaijani
bel - Belarusian
[...]
tgk - Tajik
tha - Thai
tir - Tigrinya
ton - Tongan
tur - Turkish
uig - Uyghur
ukr - Ukrainian
urd - Urdu
uzb_cyrl - Uzbek (Cyrillic)
uzb - Uzbek
vie - Vietnamese
yid - Yiddish
yor - Yoruba

The same program can also show the full language names in French, German, Italian, Spanish or other languages.
Only for equ, frk and osd it won't show a full language name because those names are not ISO names.

Therefore I don't think that Tesseract should add that API call.

yuri@FreeBSD · Answer 2 · Thu Mar 07 2024 16:47:42 GMT+0800 (China Standard Time)

Thank you for the comprehensive answer and the demo program.
I agree with you that Tesseract doesn't need that API call.

Stefan Weil · Answer 3 · Thu Mar 07 2024 17:31:25 GMT+0800 (China Standard Time)

Regarding frk.traineddata, it looks like the ISO code should be deu_latf. Then the full language name German (Fraktur Latin) can be derived automatically.