Exported OpenNMT models producing unexpected prediction lengths

Question

Exported OpenNMT models producing unexpected prediction lengths

dmar1n opened this issue 6 months ago · comments

Hi,

I'm training a multilingual TransformerBig with OpenNMT-tf and converting the models into CTranslate2. I use a shared vocabulary of 64k tokens.

The predictions during eval and from the checkpoints are OK. The model exported as saved model is also producing the expected outputs.

However, the exported CTranslate2 models always predict the maximum segment length (256) regardless of the length of the input sentence. More exactly, the first tokens of the prediction are OK, but then the rest is just random tokens, sometimes repeated many times.

Example:

Input sentence: "speech aids for use in voice restoration"
Input tokenised: ['<en>', '<es>', 's', 'pe', 'ech', '▁aids', '▁for', '▁use', '▁in', '▁voice', '▁restoration']
Output tokenised: ['<es>', 'ayuda', 's', '▁para', '▁la', '▁recuperación', '▁de', '▁voz', 'a', '▁para', '▁su', '▁uso', '▁en', '▁la', '▁restauración', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '▁para', '▁su', '▁uso', '▁en', '▁la', '▁restauración', 'a', 'a', '▁', 's', 'da', 'da', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'log', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'gre', 'a', 'a', 'ba', 'a', 'a', 'ba', 'za', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'ba', 'za', 'a', 'a', 'ba', 'a', 'a', 'ba', 'a', 'a', 'a', 'a', '▁para', '▁su', 'a', 'a', 'ba', 'za', 'a', 'a', '▁', 's', 'a', 'a', 'ba', 'a', 'a', 'ba', 'za', 'a', 'a', 'ba', 'za', 'a', 'a', 'ba', 'za', 'a', 'a', 'ba', 'za', 'a', 'a', 'ba', 'za', 'a', 'a', 'ba', 'za', 'a', 'a', 'ba', 'za', 'a', 'a', 'cu', 's', 'a', 'pom', 's', 'a', 'pom', 's', 'a', 'pom', 's', 'a', 'pom', 's', 'a', 'pom', 's', 'a', 'a', 'ba']

Versions:

CTranslate2 3.20.0
OpenNMT-tf 2.32.0
tensorflow 2.11.1

(I have tried other versions with the same results.)

The config.json generated by the training process when exporting on best BLEU is the following:

{
  "add_source_bos": false,
  "add_source_eos": false,
  "bos_token": "<s>",
  "decoder_start_token": "<s>",
  "eos_token": "</s>",
  "layer_norm_epsilon": null,
  "unk_token": "<unk>"
}

The configuration of the model training is mostly defaults (auto_config enabled). The vocab is configured as follows (both source and target vocabs point to the same file, as indicated in the OpenNMT docs):

data:
  source_vocabulary: vocab.txt
  target_vocabulary: vocab.txt

The lines used to run the prediction with CT2 are the following:

tokens, _ = self.tokenizer.tokenize(line, training=False)
output = self.translator.translate_batch(
                [tokens],
                beam_size=1,
                batch_type="tokens",
                max_batch_size=4096,
            )
            yield translation[0].hypotheses[0]

Different beam sizes produce similar results.

Daniel Marín · Answer 1 · Tue Jan 30 2024 19:01:35 GMT+0800 (China Standard Time)

The problem was in the conversion of the vocab from SentencePiece format to OpenNMT format.