Reorganize the lists of languages
avidale opened this issue · comments
The problem
Currently, the file https://github.com/facebookresearch/LASER/blob/MLH-dev/laser_encoders/language_list.py is messy in several ways:
- The dictionary
LASER2_LANGUAGE
includes 204 unique languoids (that is, language+script combinations), although we know that LASER-2 supports the same set of ~93 languages as LASER1: https://github.com/facebookresearch/LASER#supported-languages. - The dicts
LASER2_LANGUAGE
andLASER3_LANGUAGE
are redundant: both of them contain various renames of the same languoids, so if we ever update the names of some language, we have to manually update both of the lists.
Proposed solution
- Reduce the LASER2 language list only to the languages really supported by LASER-2 (in case of uncertainty about specific language, contact us in the chat or in Github comments)
- Separate the list of names for each language from the lists of LASER-2 and LASER-3 languages, like below:
# declare the languages supported by models, regardless of their renames
LASER2_LANGUAGES_LIST = ["", ...] # there should be about 93 languages
LASER3_LANGUAGES_LIST = ["ace_Latn", "aka_Latn", ...] # there should be 147 languages
# declare various language names, regardless of the model
LANGUAGE_NAMES = {
"ace_Latn": ["acehnese", "ace", "ace_Latn"],
"aka_Latn": ["akan", "aka", "aka_Latn"],
...
}
# now combine these declarations automatically:
LASER2_LANGUAGE = build_language_names_dict(LASER2_LANGUAGES_LIST, LANGUAGE_NAMES)
LASER3_LANGUAGE = build_language_names_dict(LASER3_LANGUAGES_LIST, LANGUAGE_NAMES)
# as a result, we should get dicts in our original format, {langouid name -> langouid code or list of codes}