Data Collator Incorrect When Using a Decoder Prefix
seanlgoldberg opened this issue · comments
https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/core/trainers/data_collator.py
Hello!
In the call of the DataCollator class the max feature length of the decoder is determined by the max size of the labels:
max_feature_len_decoder = max([f["labels"].shape[0] for f in features])
This makes the 'target_len_decoder' variable dependent on the label size only. Thus, if you're using a decoder prefix (via 'decoder_input_ids'), the sequence gets incorrectly truncated to the size of the labels:
if key in ['decoder_input_ids', 'labels', 'decoder_attention_mask', 'decoder_seg_data']:
batched_feature = torch.stack([pad_sequence_native(f[key], target_len_decoder, pad_value) for f in features], dim=0)
Thus, whenever you have a decoder prefix longer than the label length, it gets cut off. This might not be an issue for UDOP pretraining, but is very much an issue for something like question-answering fine-tuning.
A better way to calculate the length would be:
max_feature_len_decoder = max([f["labels"].shape[0]+f['decoder_input_ids'].shape[0] for f in features])