[Question] preprocess_plain，no question part

Question

[Question] preprocess_plain，no question part

xiaoyudxy opened this issue a month ago · comments

Question

Hi,authors, Thank you for your great contribution

I've noticed that during the pretraining phase, the preprocess_plain method was used. This method discards the question part and directly concatenates with the answer. Could you explain the rationale behind this approach? Why is the question discarded instead of being retained?

`def preprocess_plain(
sources: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
# add end signal and concatenate together
conversations = []
for source in sources:
assert len(source) == 2
assert DEFAULT_IMAGE_TOKEN in source[0]['value']
source[0]['value'] = DEFAULT_IMAGE_TOKEN
conversation = source[0]['value'] + source[1]['value'] + conversation_lib.default_conversation.sep
conversations.append(conversation)
# tokenize conversations
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
tokenized_len = len(tokenizer_image_token(source[0]['value'], tokenizer))
target[:tokenized_len] = IGNORE_INDEX

return dict(input_ids=input_ids, labels=targets)`

daixiangzi · Answer 1 · Wed Apr 24 2024 00:43:48 GMT+0800 (China Standard Time)

+1+1，I meet this question today

jzhzhang · Answer 2 · Thu Apr 25 2024 23:38:45 GMT+0800 (China Standard Time)

Same question. I noticed that in the paper, the pertaining stage uses question X_q. Why pertaining uses preprocess_plain function?

xiaoyudxy · Answer 3 · Fri Apr 26 2024 20:01:52 GMT+0800 (China Standard Time)

I have found the answer. Look at these two sites.
#615
https://github.com/haotian-liu/LLaVA/releases/tag/v1.0.1

The author says:
Pretraining. We simplified the pretraining prompts by removing additional instructions like Describe the image details, which we find to allow the zero-shot inference and can slightly improve the training speed.