Can we use in-context multimodal data for finetuning?

Question

Can we use in-context multimodal data for finetuning?

waltonfuture opened this issue a month ago · comments

Thanks for your great work! However, it seems that we can only use data that contains one image for SFT. Can we use in-context multimodal data (i.e., containing multiple images) for finetuning?

qianyu chen · Answer 1 · Fri Jun 07 2024 21:48:13 GMT+0800 (China Standard Time)

yes, the code supports multi-image finetuning

Lai Wei · Answer 2 · Fri Jun 07 2024 22:03:30 GMT+0800 (China Standard Time)

yes, the code supports multi-image finetuning

Thank you. How should I organize my data for multi-image sft? And how to inference with multi-image?

Spidey · Answer 3 · Tue Jun 11 2024 10:53:25 GMT+0800 (China Standard Time)

Same problem here. Any update on multi-image sft?

Lai Wei · Answer 4 · Thu Jun 13 2024 17:43:24 GMT+0800 (China Standard Time)

@qyc-98 Hello! Can you provide some simple examples of in-context inference or SFT? Thanks a lot!

1SingleFeng · Answer 5 · Fri Jun 14 2024 14:23:16 GMT+0800 (China Standard Time)

@qyc-98 I have encountered the same problem. Have you resolved it