Data Collator Incorrect When Using a Decoder Prefix

Question

Data Collator Incorrect When Using a Decoder Prefix

seanlgoldberg opened this issue a year ago · comments

https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/core/trainers/data_collator.py

Hello!

In the call of the DataCollator class the max feature length of the decoder is determined by the max size of the labels:

max_feature_len_decoder = max([f["labels"].shape[0] for f in features])

This makes the 'target_len_decoder' variable dependent on the label size only. Thus, if you're using a decoder prefix (via 'decoder_input_ids'), the sequence gets incorrectly truncated to the size of the labels:

if key in ['decoder_input_ids', 'labels', 'decoder_attention_mask', 'decoder_seg_data']:
batched_feature = torch.stack([pad_sequence_native(f[key], target_len_decoder, pad_value) for f in features], dim=0)

Thus, whenever you have a decoder prefix longer than the label length, it gets cut off. This might not be an issue for UDOP pretraining, but is very much an issue for something like question-answering fine-tuning.

A better way to calculate the length would be:

max_feature_len_decoder = max([f["labels"].shape[0]+f['decoder_input_ids'].shape[0] for f in features])