UKPLab / EasyNMT

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some translations are not possible

leandroalbero opened this issue · comments

Issue description

Running latest image easynmt/api:2.0-cpu with the model set to m2m_100_418M and english as target language fails for some translations. Here are some examples:

  • 'imagina a mi'
  • 'imagina un sol'
  • 'imagina a un vikingo'

image
In this case for example, setting the source_lang to 'es' fixed the issue, so maybe the problem is somewhere in the language detection step or that there isn't a translation direction from the detected language to english.

Docker logs output:

[2023-09-28 08:38:08 +0000] [60] [INFO] Waiting for application startup.
[2023-09-28 08:38:08 +0000] [60] [INFO] Application startup complete.
Exception: 'jbo'

the text of the exception varies with every prompt, I guess it is the code of the detected language

Updating the model used by fasttext for language identification helps solve the issue, at least for the translations that failed in my tests.
https://fasttext.cc/docs/en/language-identification.html
This repo is using lid.176.ftz, switching to lid.176.bin helps because it is slightly more accurate
Lines to change are here:

EasyNMT/easynmt/EasyNMT.py

Lines 415 to 430 in 7c11ae8

def language_detection_fasttext(self, text: str) -> str:
"""
Given a text, detects the language code and returns the ISO language code. It supports 176 languages. Uses
the fasttext model for language detection:
https://fasttext.cc/blog/2017/10/02/blog-post.html
https://fasttext.cc/docs/en/language-identification.html
"""
if self._fasttext_lang_id is None:
import fasttext
fasttext.FastText.eprint = lambda x: None #Silence useless warning: https://github.com/facebookresearch/fastText/issues/1067
model_path = os.path.join(self._cache_folder, 'lid.176.ftz')
if not os.path.exists(model_path):
http_get('https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz', model_path)
self._fasttext_lang_id = fasttext.load_model(model_path)

Yet there are still some translations that fail, maybe enabling a fallback in those cases to a slower model could help