haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Home Page:https://llava.hliu.cc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] preprocess_plain,no question part

xiaoyudxy opened this issue · comments

Question

Hi,authors, Thank you for your great contribution

I've noticed that during the pretraining phase, the preprocess_plain method was used. This method discards the question part and directly concatenates with the answer. Could you explain the rationale behind this approach? Why is the question discarded instead of being retained?

`def preprocess_plain(
sources: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
# add end signal and concatenate together
conversations = []
for source in sources:
assert len(source) == 2
assert DEFAULT_IMAGE_TOKEN in source[0]['value']
source[0]['value'] = DEFAULT_IMAGE_TOKEN
conversation = source[0]['value'] + source[1]['value'] + conversation_lib.default_conversation.sep
conversations.append(conversation)
# tokenize conversations
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
tokenized_len = len(tokenizer_image_token(source[0]['value'], tokenizer))
target[:tokenized_len] = IGNORE_INDEX

return dict(input_ids=input_ids, labels=targets)`

+1+1,I meet this question today

Same question. I noticed that in the paper, the pertaining stage uses question X_q. Why pertaining uses preprocess_plain function?
image

I have found the answer. Look at these two sites.
#615
https://github.com/haotian-liu/LLaVA/releases/tag/v1.0.1

The author says:
Pretraining. We simplified the pretraining prompts by removing additional instructions like Describe the image details, which we find to allow the zero-shot inference and can slightly improve the training speed.