Reorganize the lists of languages

Question

Reorganize the lists of languages

avidale opened this issue 8 months ago · comments

The problem

Currently, the file https://github.com/facebookresearch/LASER/blob/MLH-dev/laser_encoders/language_list.py is messy in several ways:

The dictionary LASER2_LANGUAGE includes 204 unique languoids (that is, language+script combinations), although we know that LASER-2 supports the same set of ~93 languages as LASER1: https://github.com/facebookresearch/LASER#supported-languages.
The dicts LASER2_LANGUAGE and LASER3_LANGUAGE are redundant: both of them contain various renames of the same languoids, so if we ever update the names of some language, we have to manually update both of the lists.

Proposed solution

Reduce the LASER2 language list only to the languages really supported by LASER-2 (in case of uncertainty about specific language, contact us in the chat or in Github comments)
Separate the list of names for each language from the lists of LASER-2 and LASER-3 languages, like below:

# declare the languages supported by models, regardless of their renames
LASER2_LANGUAGES_LIST = ["", ...]  # there should be about 93 languages
LASER3_LANGUAGES_LIST = ["ace_Latn", "aka_Latn", ...]  # there should be 147 languages
# declare various language names, regardless of the model
LANGUAGE_NAMES = {
    "ace_Latn": ["acehnese", "ace", "ace_Latn"],
    "aka_Latn": ["akan", "aka", "aka_Latn"],
    ...
}
# now combine these declarations automatically:
LASER2_LANGUAGE = build_language_names_dict(LASER2_LANGUAGES_LIST, LANGUAGE_NAMES)
LASER3_LANGUAGE = build_language_names_dict(LASER3_LANGUAGES_LIST, LANGUAGE_NAMES)
# as a result, we should get dicts in our original format, {langouid name -> langouid code or list of codes}