prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using masked inputs at inference time

jaspock opened this issue · comments

I am considering using YANMTT to train my own BART model. However, instead of using it as the initial model for a subsequent fine-tuning process, I am interested in using the BART model itsel to generate alternative versions of the input sentence. To do this, I would like to mask a percentage of the words in a sentence at inference time and let the model generate a variation of it via beam search decoding:

  • Original sentence: Mike goes to the bookstore on Thursday
  • Possible masked input sentence: <mask> goes to the bookstore <mask>
  • Possible model output: Jerry happily goes to the bookstore with his friends

Can this be easily done with YANMTT? I am trying to have my own model for the generation of synthetic samples discussed in the paper "Detecting Hallucinated Content in Conditional Neural Sequence Generation" (section 3.1).

Hi,

What you need can be done with YANMTT.

For YANMTT the following decoding options should help:

--use_official_pretrained --model_path facebook/bart-large --slang en --tlang en --test_src

You may also want to play with options like: --encoder_no_repeat_ngram_size N (N=1,2,3, 4)

Honestly, you don't need YANMTT for your particular use case since you can directly use this block from the HF pages: https://huggingface.co/docs/transformers/model_doc/bart#mask-filling

Thank you, @prajdabre. It is much appreciated that you are helping us with all our issues! 👍

In my case, I am interested in having BART-like models for languages which are not English and are not available in mBART. This is why I ended up finding your toolkit.

So, explicitly adding the string "<mask>" in the input sentence will make the system consider it as a mask to be predicted? Is it possible to make the system automatically mask a percentage of the input tokens or replace a percentage of tokens with some other token? Basically, I want to alter the sentence in the same way it is done during BART training and then decode with beam search to have variations of the input sentence.

Finally, it is not completely clear to me how to train a system for my language of interest and then use it to denoise sentences. I know you mention reading the documentation, but is there an easier-to-follow tutorial that can teach me to do that?

Hi,

Sorry for the late reply! Weekends are rest days for me :)

So, explicitly adding the string "" in the input sentence will make the system consider it as a mask to be predicted? ---> YES!

Is it possible to make the system automatically mask a percentage of the input tokens or replace a percentage of tokens with some other token? --> YES! Please look into the options "--mask_input, --token_masking_lambda and --token_masking_probs_range". Note that this is very random!

Finally, it is not completely clear to me how to train a system for my language of interest and then use it to denoise sentences. I know you mention reading the documentation, but is there an easier-to-follow tutorial that can teach me to do that? --> Everything you need to know is here: https://github.com/prajdabre/yanmtt/blob/main/examples/train_mbart_model.sh
If this is insufficient then let me know!

EDIT: Please pull the latest code since I have released V2 last week!