marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository

Home Page:https://marian-nmt.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Replacing unknown words with most attentive source token

alvations opened this issue · comments

In OpenNMT, it's possible to replace <unk> with most attentive source words https://opennmt.net/OpenNMT/translation/unknowns/

Often times symbols will correspond to proper names that can be directly transposed between languages. The -replace_unk option will substitute with source words that have the highest attention weight.

OpenNMT is checking the <unk> then look into the attention weights, https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/translate/translator.py#L169

Currently, it's a much cleaner to do https://github.com/marian-nmt/marian-dev/blob/master/src/translator/beam_search.cpp#L301 with --allow-unk but checking the hypotheses' alignment might be good to replicate OpenNMT -replace_unk behavior.

(Not sure how to aggregate the attention heads too. Max? Max-Pool? Average?)

(Not sure how this would work out in word pieces though, maybe trust the model to learn a source token splits into pieces retains the order of the attention?)

Hi,
Have you tried that with a transformer model?
We can train with guided attention, but then again if you just do that then you can implement this externally. Also if you have UNKs with wordpieces, then you are doing wordpieces wrong :)

In OpenNMT, I've only tried that with a modified LSTM model. Not sure how they do it for transformer *shrugs...

Yeah, I feel like this is a lot of effort with little return.

One benefit it to handle mixed inputs where the vocab doesn't have chars/sub-words from another language.

E.g. EN->JA with some mixed Hindi/Sanskrit.

Source: Rāja (Sanskrit: राज) means "chief, best of its kind" or "king".
Target: Rāja(サンスクリット語:<unk>)は、「最高、その種の最高」または「王」を意味します。

Would be nice if it's Rāja(サンスクリット語:राज)は、「最高、その種の最高」または「王」を意味します。

ASCII escaping, probably more reliable.

convert राज to [[e0a4b0e0a4bee0a49c]], apply sentenciepiece, translate, (hope it survives), convert back.

@alvations can we close this? Basically the answer is that we are not going to do it unless someone else does.

Sure, I think we can close this until someone revives it.

In case someone is interested, I made a Python package that does this. https://github.com/ZJaume/escape-unk