Data-centric Multimodal LLM

Survey on data-centric multimodal large language models

List of Sources

Source Name	Source Link	Type
CommonCrawl	https://commoncrawl.org/	Common Webpages
Flickr	https://www.flickr.com/	Common Webpages
Flickr Video	https://www.flickr.com/photos/tags/vídeo/	Common Webpages
FreeSound	https://freesound.org	Common Webpages
BBC Sound Effects4	https://sound-effects.bbcrewind.co.uk	Common Webpages
SoundBible	https://soundbible.com/	Common Webpages
Wikipedia	https://www.wikipedia.org/	Wikipedia
Wikimedia Commons	https://commons.wikimedia.org/	Wikipedia
Stack Exchange	https://stackexchange.com/	Social Media
Reddit	https://www.reddit.com/	Social Media
Ubuntu IRC	https://ubuntu.com/	Social Media
Youtube	https://www.youtube.com	Social Media
X	https://x.com	Social Media
S2ORC	https://github.com/allenai/s2orc	Academic Papers
Arxiv	https://arxiv.org/	Academic Papers
Project Gutenberg	https://www.gutenberg.org	Books
Smashwords	https://www.smashwords.com/	Books
Bibliotik	https://bibliotik.me/	Books
National Diet Library	https://dl.ndl.go.jp/ja/photo	Books
BigQuery public dataset	https://cloud.google.com/bigquery/public-data	Code
GitHub	https://github.com/	Code
FreeLaw	https://www.freelaw.in/	Legal
Chinese legal documents	https://www.spp.gov.cn/spp/fl/	Legal
Khan Academy exercises	https://www.khanacademy.org	Maths
MEDLINE	www.medline.com	Medical
Patient	https://patient.info	Medical
WebMD	https://www.webmd.com	Medical
NIH	https://www.nih.gov/	Medical
39 Ask Doctor	https://ask.39.net/	Medical
Medical Exams	https://drive.google.com/file/d/1ImYUSLk9JbgHXOemfvyiDiirluZHPeQw/view	Medical
Baidu Doctor	https://muzhi.baidu.com/	Medical
120 Asks	https://www.120ask.com/	Medical
BMJ Case Reports	https://casereports.bmj.com	Medical
XYWY	http://www.xywy.com	Medical
Qianwen Health	https://51zyzy.com	Medical
PubMed	https://pubmed.ncbi.nlm.nih.gov	Medical
EDGAR	https://www.sec.gov/edgar	Financial
SEC Financial Statement and Notes Data Sets	https://www.sec.gov/dera/data/financial-statement-and-notes-data-set	Financial
Sina Finance	https://finance.sina.com.cn/	Financial
Tencent Finance	https://new.qq.com/ch/finance/	Financial
Eastmoney	https://www.eastmoney.com/	Financial
Guba	https://guba.eastmoney.com/	Financial
Xueqiu	https://xueqiu.com/	Financial
Phoenix Finance	https://finance.ifeng.com/	Financial
36Kr	https://36kr.com/	Financial
Huxiu	https://www.huxiu.com/	Financial

Commonly-used datasets

Textual-Pretraining Datasets:

Datasets	Link
RedPajama-Data-1T	https://www.together.ai/blog/redpajama
RedPajama-Data-v2	https://www.together.ai/blog/redpajama-data-v2
SlimPajama	https://huggingface.co/datasets/cerebras/SlimPajama-627B
Falcon-RefinedWeb	https://huggingface.co/datasets/tiiuae/falcon-refinedweb
Pile	https://github.com/EleutherAI/the-pile?tab=readme-ov-file
ROOTS	https://huggingface.co/bigscience-data
WuDaoCorpora	https://data.baai.ac.cn/details/WuDaoCorporaText
Common Crawl	https://commoncrawl.org/
C4	https://huggingface.co/datasets/c4
mC4	https://arxiv.org/pdf/2010.11934.pdf
Dolma Dataset	https://github.com/allenai/dolma
OSCAR-22.01	https://oscar-project.github.io/documentation/versions/oscar-2201/
OSCAR-23.01	https://huggingface.co/datasets/oscar-corpus/OSCAR-2301
colossal-oscar-1.0	https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0
Wiki40b	https://www.tensorflow.org/datasets/catalog/wiki40b
Pushshift Reddit Dataset	https://paperswithcode.com/dataset/pushshift-reddit
OpenWebTextCorpus	https://paperswithcode.com/dataset/openwebtext
OpenWebText2	https://openwebtext2.readthedocs.io/en/latest/
BookCorpus	https://huggingface.co/datasets/bookcorpus
Gutenberg	https://shibamoulilahiri.github.io/gutenberg_dataset.html
CC-Stories-R	https://paperswithcode.com/dataset/cc-stories
CC-NEWES	https://huggingface.co/datasets/cc_news
REALNEWS	https://paperswithcode.com/dataset/realnews
Reddit submission dataset	https://www.philippsinger.info/reddit/
General Reddit Dataset	https://www.tensorflow.org/datasets/catalog/reddit
AMPS	https://drive.google.com/file/d/1hQsua3TkpEmcJD_UWQx8dmNdEZPyxw23/view

MM-Pretraining Datasets:

Dataset Name	Paper Title (with hyperlink)	Modality
ALIGN	Scaling up visual and vision-language representation learning with noisy text supervision	Graph
LTIP	Flamingo: a visual language model for few-shot learning	Graph
MS-COCO	Microsoft coco: Common objects in context	Graph
Visual Genome	Visual genome: Connecting language and vision using crowdsourced dense image annotations	Graph
CC3M	Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning	Graph
CC12M	Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts	Graph
SBU	Im2text: Describing images using 1 million captioned photographs	Graph
LAION-5B	Laion-5b: An open large-scale dataset for training next generation image-text models	Graph
LAION-400M	Laion-400m: Open dataset of clip-filtered 400 million image-text pairs	Graph
LAION-COCO	Laion-coco: In the style of MS COCO	Graph
Flickr30k	From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions	Graph
AI Challenger	Ai challenger: A large-scale dataset for going deeper in image understanding	Graph
COYO	COYO-700M: Image-Text Pair Dataset	Graph
Wukong	Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark	Graph
COCO Caption	Microsoft coco captions: Data collection and evaluation server	Graph
WebLI	Pali: A jointly-scaled multilingual language-image model	Graph
Episodic WebLI	Pali-x: On scaling up a multilingual vision and language model	Graph
CC595k	Visual instruction tuning	Graph
ReferItGame	Referitgame: Referring to objects in photographs of natural scenes	Graph
RefCOCO&RefCOCO+	Modeling context in referring expressions	Graph
Visual-7W	Visual7w: Grounded question answering in images	Graph
OCR-VQA	Ocr-vqa: Visual question answering by reading text in images	Graph
ST-VQA	Scene text visual question answering	Graph
DocVQA	Docvqa: A dataset for vqa on document images	Graph
TextVQA	Towards vqa models that can read	Graph
DataComp	Datacomp: In search of the next generation of multimodal datasets	Graph
GQA	Gqa: A new dataset for real-world visual reasoning and compositional question answering	Graph
VQA	VQA: Visual Question Answering	Graph
VQAv2	Making the v in vqa matter: Elevating the role of image understanding in visual question answering	Graph
DVQA	Dvqa: Understanding data visualizations via question answering	Graph
A-OK-VQA	A-okvqa: A benchmark for visual question answering using world knowledge	Graph
Text Captions	Textcaps: a dataset for image captioning with reading comprehension	Graph
M3W	Flamingo: a visual language model for few-shot learning	Graph
MMC4	Multimodal c4: An open, billion-scale corpus of images interleaved with text	Graph
MSRVTT	Msr-vtt: A large video description dataset for bridging video and language	Video
WebVid-2M	Frozen in time: A joint video and image encoder for end-to-end retrieval	Video
VTP	Flamingo: a visual language model for few-shot learning	Video
AISHELL-1	Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline	Audio
AISHELL-2	Aishell-2: Transforming mandarin asr research into industrial scale	Audio
WaveCaps	Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research	Audio
VisDial	Visual dialog	Image
VSDial-CN	X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages	Image, Audio
MELON	Audio Retrieval for Multimodal Design Documents: A New Dataset and Algorithms	Image, Text, Audio

Common Textual SFT Datasets:

Dataset Name	Language	Construction Method	Github Link	Paper Link	Dataset Link
databricks-dolly-15K	EN	HG			https://huggingface.co/datasets/databricks/databricks-dolly-15k
InstructionWild_v2	EN & ZH	HG	https://github.com/XueFuzhao/InstructionWild
LCCC	ZH	HG	https://github.com/thu-coai/CDial-GPT	https://arxiv.org/pdf/2008.03946.pdf
OASST1	Multi (35)	HG	https://github.com/imoneoi/openchat	https://arxiv.org/pdf/2309.11235.pdf	https://huggingface.co/openchat
OL-CC	ZH	HG			https://data.baai.ac.cn/details/OL-CC
Zhihu-KOL	ZH	HG	https://github.com/wangrui6/Zhihu-KOL		https://huggingface.co/datasets/wangrui6/Zhihu-KOL
Aya Dataset	Multi (65)	HG		https://arxiv.org/abs/2402.06619	https://hf.co/datasets/CohereForAI/aya_dataset
InstructIE	EN & ZH	HG	https://github.com/zjunlp/KnowLM	https://arxiv.org/abs/2305.11527	https://huggingface.co/datasets/zjunlp/InstructIE
Alpaca_data	EN	MC	https://github.com/tatsu-lab/stanford_alpaca#data-release
BELLE_Generated_Chat	ZH	MC	https://github.com/LianjiaTech/BELLE/tree/main/data/10M		https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M
BELLE_Multiturn_Chat	ZH	MC	https://github.com/LianjiaTech/BELLE/tree/main/data/10M		https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M
BELLE_train_0.5M_CN	ZH	MC	https://github.com/LianjiaTech/BELLE/tree/main/data/1.5M		https://huggingface.co/datasets/BelleGroup/train_0.5M_CN
BELLE_train_1M_CN	ZH	MC	https://github.com/LianjiaTech/BELLE/tree/main/data/1.5M		https://huggingface.co/datasets/BelleGroup/train_1M_CN
BELLE_train_2M_CN	ZH	MC	https://github.com/LianjiaTech/BELLE/tree/main/data/10M		https://huggingface.co/datasets/BelleGroup/train_2M_CN
BELLE_train_3.5M_CN	ZH	MC	https://github.com/LianjiaTech/BELLE/tree/main/data/10M		https://huggingface.co/datasets/BelleGroup/train_3.5M_CN
CAMEL	Multi & PL	MC	https://github.com/camel-ai/camel	https://arxiv.org/pdf/2303.17760.pdf	https://huggingface.co/camel-ai
Chatgpt_corpus	ZH	MC	https://github.com/PlexPt/chatgpt-corpus/releases/tag/3
InstructionWild_v1	EN & ZH	MC	https://github.com/XueFuzhao/InstructionWild
LMSYS-Chat-1M	Multi	MC		https://arxiv.org/pdf/2309.11998.pdf	https://huggingface.co/datasets/lmsys/lmsys-chat-1m
MOSS_002_sft_data	EN & ZH	MC	https://github.com/OpenLMLab/MOSS		https://huggingface.co/datasets/fnlp/moss-002-sft-data
MOSS_003_sft_data	EN & ZH	MC	https://github.com/OpenLMLab/MOSS
MOSS_003_sft_plugin_data	EN & ZH	MC	https://github.com/OpenLMLab/MOSS
OpenChat	EN	MC	https://github.com/imoneoi/openchat	https://arxiv.org/pdf/2309.11235.pdf	https://huggingface.co/openchat
RedGPT-Dataset-V1-CN	ZH	MC	https://github.com/DA-southampton/RedGPT
Self-Instruct	EN	MC	https://github.com/yizhongw/self-instruct	https://aclanthology.org/2023.acl-long.754.pdf
ShareChat	Multi	MC
ShareGPT-Chinese-English-90k	EN & ZH	MC	https://github.com/CrazyBoyM/llama2-Chinese-chat		https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k
ShareGPT90K	EN	MC			https://huggingface.co/datasets/RyokoAI/ShareGPT52K
UltraChat	EN	MC	https://github.com/thunlp/UltraChat#UltraLM	https://arxiv.org/pdf/2305.14233.pdf
Unnatural	EN	MC	https://github.com/orhonovich/unnatural-instructions	https://aclanthology.org/2023.acl-long.806.pdf
WebGLM-QA	EN	MC	https://github.com/THUDM/WebGLM	https://arxiv.org/pdf/2306.07906.pdf	https://huggingface.co/datasets/THUDM/webglm-qa
Wizard_evol_instruct_196K	EN	MC	https://github.com/nlpxucan/WizardLM	https://arxiv.org/pdf/2304.12244.pdf	https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
Wizard_evol_instruct_70K	EN	MC	https://github.com/nlpxucan/WizardLM	https://arxiv.org/pdf/2304.12244.pdf	https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
CrossFit	EN	CI	https://github.com/INK-USC/CrossFit	https://arxiv.org/pdf/2104.08835.pdf
DialogStudio	EN	CI	https://github.com/salesforce/DialogStudio	https://arxiv.org/pdf/2307.10172.pdf	https://huggingface.co/datasets/Salesforce/dialogstudio
Dynosaur	EN	CI	https://github.com/WadeYin9712/Dynosaur	https://arxiv.org/pdf/2305.14327.pdf	https://huggingface.co/datasets?search=dynosaur
Flan-mini	EN	CI	https://github.com/declare-lab/flacuna	https://arxiv.org/pdf/2307.02053.pdf	https://huggingface.co/datasets/declare-lab/flan-mini
Flan	Multi	CI	https://github.com/google-research/flan	https://arxiv.org/pdf/2109.01652.pdf
Flan	Multi	CI	https://github.com/google-research/FLAN/tree/main/flan/v2	https://arxiv.org/pdf/2301.13688.pdf	https://huggingface.co/datasets/SirNeural/flan_v2
InstructDial	EN	CI	https://github.com/prakharguptaz/Instructdial	https://arxiv.org/pdf/2205.12673.pdf
NATURAL INSTRUCTIONS	EN	CI	https://github.com/allenai/natural-instructions	https://aclanthology.org/2022.acl-long.244.pdf	https://instructions.apps.allenai.org/
OIG	EN	CI			https://huggingface.co/datasets/laion/OIG
Open-Platypus	EN	CI	https://github.com/arielnlee/Platypus	https://arxiv.org/pdf/2308.07317.pdf	https://huggingface.co/datasets/garage-bAInd/Open-Platypus
OPT-IML	Multi	CI	https://github.com/facebookresearch/metaseq	https://arxiv.org/pdf/2212.12017.pdf
PromptSource	EN	CI	https://github.com/bigscience-workshop/promptsource	https://aclanthology.org/2022.acl-demo.9.pdf
SUPER-NATURAL INSTRUCTIONS	Multi	CI	https://github.com/allenai/natural-instructions	https://arxiv.org/pdf/2204.07705.pdf
T0	EN	CI		https://arxiv.org/pdf/2110.08207.pdf
UnifiedSKG	EN	CI	https://github.com/xlang-ai/UnifiedSKG	https://arxiv.org/pdf/2201.05966.pdf
xP3	Multi (46)	CI	https://github.com/bigscience-workshop/xmtf	https://aclanthology.org/2023.acl-long.891.pdf
IEPile	EN & ZH	CI	https://github.com/zjunlp/IEPile	https://arxiv.org/abs/2402.14710	https://huggingface.co/datasets/zjunlp/iepile
Firefly	ZH	HG & CI	https://github.com/yangjianxin1/Firefly		https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
LIMA-sft	EN	HG & CI		https://arxiv.org/pdf/2305.11206.pdf	https://huggingface.co/datasets/GAIR/lima
COIG-CQIA	ZH	HG & CI		https://arxiv.org/abs/2403.18058	https://huggingface.co/datasets/m-a-p/COIG-CQIA
InstructGPT-sft	EN	HG & MC		https://arxiv.org/pdf/2203.02155.pdf
Alpaca_GPT4_data	EN	CI & MC	https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release	https://arxiv.org/pdf/2304.03277.pdf
Alpaca_GPT4_data_zh	ZH	CI & MC	https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release		https://huggingface.co/datasets/shibing624/alpaca-zh
Bactrain-X	Multi (52)	CI & MC	https://github.com/mbzuai-nlp/bactrian-x	https://arxiv.org/pdf/2305.15011.pdf	https://huggingface.co/datasets/MBZUAI/Bactrian-X
Baize	EN	CI & MC	https://github.com/project-baize/baize-chatbot	https://arxiv.org/pdf/2304.01196.pdf	https://github.com/project-baize/baize-chatbot/tree/main/data
GPT4All	EN	CI & MC	https://github.com/nomic-ai/gpt4all	https://gpt4all.io/reports/GPT4All_Technical_Report_3.pdf	https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPT4all
GuanacoDataset	Multi	CI & MC			https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
LaMini-LM	EN	CI & MC	https://github.com/mbzuai-nlp/LaMini-LM	https://arxiv.org/pdf/2304.14402.pdf	https://huggingface.co/datasets/MBZUAI/LaMini-instruction
LogiCoT	EN & ZH	CI & MC	https://github.com/csitfun/logicot	https://arxiv.org/pdf/2305.12147.pdf	https://huggingface.co/datasets/csitfun/LogiCoT
LongForm	EN	CI & MC	https://github.com/akoksal/LongForm	https://arxiv.org/pdf/2304.08460.pdf	https://huggingface.co/datasets/akoksal/LongForm
Luotuo-QA-B	EN & ZH	CI & MC	https://github.com/LC1332/Luotuo-QA		https://huggingface.co/datasets/Logic123456789/Luotuo-QA-B
OpenOrca	Multi	CI & MC		https://arxiv.org/pdf/2306.02707.pdf	https://huggingface.co/datasets/Open-Orca/OpenOrca
Wizard_evol_instruct_zh	ZH	CI & MC	https://github.com/LC1332/Chinese-alpaca-lora		https://huggingface.co/datasets/silk-road/Wizard-LM-Chinese-instruct-evol
COIG	ZH	HG & CI & MC	https://github.com/FlagOpen/FlagInstruct	https://arxiv.org/pdf/2304.07987.pdf	https://huggingface.co/datasets/BAAI/COIG
HC3	EN & ZH	HG & CI & MC	https://github.com/Hello-SimpleAI/chatgpt-comparison-detection	https://arxiv.org/pdf/2301.07597.pdf
Phoenix-sft-data-v1	Multi	HG & CI & MC	https://github.com/FreedomIntelligence/LLMZoo	https://arxiv.org/pdf/2304.10453.pdf	https://huggingface.co/datasets/FreedomIntelligence/phoenix-sft-data-v1
TigerBot_sft_en	EN	HG & CI & MC	https://github.com/TigerResearch/TigerBot	https://arxiv.org/abs/2312.08688	https://huggingface.co/datasets/TigerResearch/sft_en
TigerBot_sft_zh	ZH	HG & CI & MC	https://github.com/TigerResearch/TigerBot	https://arxiv.org/abs/2312.08688	https://huggingface.co/datasets/TigerResearch/sft_zh
Aya Collection	Multi (114)	HG & CI & MC	https://arxiv.org/abs/2402.06619		https://hf.co/datasets/CohereForAI/aya_collection

Domain Specific Textual SFT Datasets:

Dataset Name	Language	Domain	Construction Method	Github Link	Paper Link	Dataset Link
ChatDoctor	EN	Medical	HG & MC	https://github.com/Kent0n-Li/ChatDoctor	https://arxiv.org/ftp/arxiv/papers/2303/2303.14070.pdf
ChatMed_Consult_Dataset	ZH	Medical	MC	https://github.com/michael-wzhu/ChatMed		https://huggingface.co/datasets/michaelwzhu/ChatMed_Consult_Dataset
CMtMedQA	ZH	Medical	HG	https://github.com/SupritYoung/Zhongjing	https://arxiv.org/pdf/2308.03549.pdf	https://huggingface.co/datasets/Suprit/CMtMedQA
DISC-Med-SFT	ZH	Medical	HG & CI	https://github.com/FudanDISC/DISC-MedLLM	https://arxiv.org/pdf/2308.14346.pdf	https://huggingface.co/datasets/Flmc/DISC-Med-SFT
HuatuoGPT-sft-data-v1	ZH	Medical	HG & MC	https://github.com/FreedomIntelligence/HuatuoGPT	https://arxiv.org/pdf/2305.15075.pdf	https://huggingface.co/datasets/FreedomIntelligence/HuatuoGPT-sft-data-v1
Huatuo-26M	ZH	Medical	CI	https://github.com/FreedomIntelligence/Huatuo-26M	https://arxiv.org/pdf/2305.01526.pdf
MedDialog	EN & ZH	Medical	HG	https://github.com/UCSD-AI4H/Medical-Dialogue-System	https://aclanthology.org/2020.emnlp-main.743.pdf
Medical Meadow	EN	Medical	HG & CI	https://github.com/kbressem/medAlpaca	https://arxiv.org/pdf/2304.08247.pdf	https://huggingface.co/medalpaca
Medical-sft	EN & ZH	Medical	CI	https://github.com/shibing624/MedicalGPT	https://huggingface.co/datasets/shibing624/medical
QiZhenGPT-sft-20k	ZH	Medical	CI	https://github.com/CMKRG/QiZhenGPT
ShenNong_TCM_Dataset	ZH	Medical	MC	https://github.com/michael-wzhu/ShenNong-TCM-LLM		https://huggingface.co/datasets/michaelwzhu/ShenNong_TCM_Dataset
Code_Alpaca_20K	EN & PL	Code	MC	https://github.com/sahil280114/codealpaca
CodeContest	EN & PL	Code	CI	https://github.com/google-deepmind/code_contests	https://arxiv.org/pdf/2203.07814.pdf
CommitPackFT	EN & PL (277)	Code	HG	https://github.com/bigcode-project/octopack	https://arxiv.org/pdf/2308.07124.pdf	https://huggingface.co/datasets/bigcode/commitpackft
ToolAlpaca	EN & PL	Code	HG & MC	https://github.com/tangqiaoyu/ToolAlpaca	https://arxiv.org/pdf/2306.05301.pdf
ToolBench	EN & PL	Code	HG & MC	https://github.com/OpenBMB/ToolBench	https://arxiv.org/pdf/2307.16789v2.pdf
DISC-Law-SFT	ZH	Law	HG & CI & MC	https://github.com/FudanDISC/DISC-LawLLM	https://arxiv.org/pdf/2309.11325.pdf
HanFei 1.0	ZH	Law	-	https://github.com/siat-nlp/HanFei
LawGPT_zh	ZH	Law	CI & MC	https://github.com/LiuHC0428/LAW-GPT
Lawyer LLaMA_sft	ZH	Law	CI & MC	https://github.com/AndrewZhe/lawyer-llama	https://arxiv.org/pdf/2305.15062.pdf	https://github.com/AndrewZhe/lawyer-llama/tree/main/data
BELLE_School_Math	ZH	Math	MC	https://github.com/LianjiaTech/BELLE/tree/main/data/10M		https://huggingface.co/datasets/BelleGroup/school_math_0.25M
Goat	EN	Math	HG	https://github.com/liutiedong/goat	https://arxiv.org/pdf/2305.14201.pdf	https://huggingface.co/datasets/tiedong/goat
MWP	EN & ZH	Math	CI	https://github.com/LYH-YF/MWPToolkit	https://browse.arxiv.org/pdf/2109.00799.pdf	https://huggingface.co/datasets/Macropodus/MWP-Instruct
OpenMathInstruct-1	EN	Math	CI & MC	https://github.com/Kipok/NeMo-Skills	https://arxiv.org/abs/2402.10176	https://huggingface.co/datasets/nvidia/OpenMathInstruct-1
Child_chat_data	ZH	Education	HG & MC	https://github.com/HIT-SCIR-SC/QiaoBan
Educhat-sft-002-data-osm	EN & ZH	Education	CI	https://github.com/icalk-nlp/EduChat	https://arxiv.org/pdf/2308.02773.pdf	https://huggingface.co/datasets/ecnu-icalk/educhat-sft-002-data-osm
TaoLi_data	ZH	Education	HG & CI	https://github.com/blcuicall/taoli
DISC-Fin-SFT	ZH	Financial	HG & CI & MC	https://github.com/FudanDISC/DISC-FinLLM	http://arxiv.org/abs/2310.15205
AlphaFin	EN & ZH	Financial	HG & CI & MC	https://github.com/AlphaFin-proj/AlphaFin	https://arxiv.org/abs/2403.12582	https://huggingface.co/datasets/AlphaFin/AlphaFin-dataset-v1
GeoSignal	EN	Geoscience	HG & CI & MC	https://github.com/davendw49/k2	https://arxiv.org/pdf/2306.05064.pdf	https://huggingface.co/datasets/daven3/geosignal
MeChat	ZH	Mental Health	CI & MC	https://github.com/qiuhuachuan/smile	https://arxiv.org/pdf/2305.00450.pdf	https://github.com/qiuhuachuan/smile/tree/main/data
Mol-Instructions	EN	Biology	HG & CI & MC	https://github.com/zjunlp/Mol-Instructions	https://arxiv.org/pdf/2306.08018.pdf	https://huggingface.co/datasets/zjunlp/Mol-Instructions
Owl-Instruction	EN & ZH	IT	HG & MC	https://github.com/HC-Guo/Owl	https://arxiv.org/pdf/2309.09298.pdf
PROSOCIALDIALOG	EN	Social Norms	HG & MC		https://arxiv.org/pdf/2205.12688.pdf	https://huggingface.co/datasets/allenai/prosocial-dialog
TransGPT-sft	ZH	Transportation	HG	https://github.com/DUOMO/TransGPT		https://huggingface.co/datasets/DUOMO-Lab/TransGPT-sft

Multimodal SFT Datasets:

Model Name	Modality	Link
LRV-Instruction	Image	https://huggingface.co/datasets/VictorSanh/LrvInstruction?row=0
Clotho-Detail	Audio	https://github.com/magic-research/bubogpt/blob/main/dataset/README.md#audio-dataset-instruction
CogVLM-SFT-311K	Image	https://huggingface.co/datasets/THUDM/CogVLM-SFT-311K
ComVint	Image	https://drive.google.com/file/d/1eH5t8YoI2CGR2dTqZO0ETWpBukjcZWsd/view
DataEngine-InstData	Image	https://opendatalab.com/OpenDataLab/DataEngine-InstData
GranD_f	Image	https://huggingface.co/datasets/MBZUAI/GranD-f/tree/main
LLaVA	Image	https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
LLaVA-1.5	Image	https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json
LVLM_NLF	Image	https://huggingface.co/datasets/YangyiYY/LVLM_NLF/tree/main
M3IT	Image	https://huggingface.co/datasets/MMInstruction/M3IT
MMC-Instruction Dataset	Image	https://github.com/FuxiaoLiu/MMC/blob/main/README.md
MiniGPT-4	Image	https://drive.google.com/file/d/1nJXhoEcy3KTExr17I7BXqY5Y9Lx_-n-9/view
MiniGPT-v2	Image	https://github.com/Vision-CAIR/MiniGPT-4/blob/main/dataset/README_MINIGPTv2_FINETUNE.md
PVIT	Image	https://huggingface.co/datasets/PVIT/pvit_data_stage2/tree/main
PointLLM Instruction data	3D	https://huggingface.co/datasets/RunsenXu/PointLLM/tree/main
ShareGPT4V	Image	https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/tree/main
Shikra-RD	Image	https://drive.google.com/file/d/1CNLu1zJKPtliQEYCZlZ8ykH00ppInnyN/view
SparklesDialogue	Image	https://github.com/HYPJUDY/Sparkles/tree/main/dataset
T2M	Image,Video,Audio	https://github.com/NExT-GPT/NExT-GPT/tree/main/data/IT_data/T-T+X_data
TextBind	Image	https://drive.google.com/drive/folders/1-SkzQRInSfrVyZeB0EZJzpCPXXwHb27W
TextMonkey	Image	https://www.modelscope.cn/datasets/lvskiller/TextMonkey_data/files
VGGSS-Instruction	Image,Audio	https://bubo-gpt.github.io/
VIGC-InstData	Image	https://opendatalab.com/OpenDataLab/VIGC-InstData
VILA	Image	https://github.com/Efficient-Large-Model/VILA/tree/main/data_prepare
VLSafe	Image	https://arxiv.org/abs/2312.07533
Video-ChatGPT-video-it-data	Video	https://github.com/mbzuai-oryx/Video-ChatGPT
VideoChat-video-it-data	Video	https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data
X-InstructBLIP-it-data	Image,Video,Audio	https://github.com/salesforce/LAVIS/tree/main/projects/xinstructblip

Data-centric pretraining

Domain mixture

Doremi: Optimizing data mixtures speeds up language model pretraining - paper
Data selection for language models via importance resampling - paper
Glam: Efficient scaling of language models with mixture-of-experts - paper
Videollm: Modeling video sequence with large language models - paper
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset - paper
Moviechat: From dense token to sparse memory for long video understanding - paper
Internvid: A large-scale video-text dataset for multimodal understanding and generation - paper
Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks - paper

Modality Mixture

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training - paper
From scarcity to efficiency: Improving clip training via visual-enriched captions - paper
Valor: Vision-audio-language omni-perception pretraining model and dataset - paper
AutoAD:Moviedescription in context - paper
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding - paper
VideoChat: Chat-Centric Video Understanding - paper
Mvbench: A comprehensive multi-modal video understanding benchmark - paper
LLaMA-VID: An image is worth 2 tokens in large language models - paper
Video-llava:Learningunitedvisualrepresentation by alignment before projection - paper
Valley: Video assistant with large language model enhanced ability - paper
Video-llama: An instruction-tuned audio-visual language model for video understanding - paper
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration - paper
Audio-Visual LLM for Video Understanding - paper
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models - paper

Quality Selection

DataComp: In search of the next generation of multimodal datasets - paper
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters - paper
CiT: Curation in Training for Effective Vision-Language Data - paper
Sieve: Multimodal Dataset Pruning Using Image Captioning Models - paper
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning - paper

Data-centric adaptation

Data-Centric Supervised Finetuning

Unnatural instructions: Tuning language models with (almost) no human labor - paper
Active Learning for Convolutional Neural Networks: A Core-Set Approach - paper
Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning - paper
Similar: Submodular information measures based active learning in realistic scenarios - paper
Practical coreset constructions for machine learning - paper
Deep learning on a data diet: Finding important examples early in training - paper
A new active labeling method for deep learning - paper
Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning - paper
DEFT: Data Efficient Fine-Tuning for Pre-Trained Language Models via Unsupervised Core-Set Selection - paper
Beyond neural scaling laws: beating power law scaling via data pruning - paper
Mods: Model-oriented data selection for instruction tuning. - paper
DeBERTa: Decoding-enhanced BERT with Disentangled Attention - paper
Alpagasus: Training a better alpaca with fewer data - paper
Rethinking the Instruction Quality: LIFT is What You Need - paper
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning - paper
InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models - paper
SelectLLM: Can LLMs Select Important Instructions to Annotate? - paper
Improved Baselines with Visual Instruction Tuning - paper
NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks - paper
LESS: Selecting Influential Data for Targeted Instruction Tuning - paper
From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning - paper
One shot learning as instruction data prospector for large language models - paper
Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks - paper
SelectIT: Selective Instruction Tuning for Large Language Models via Uncertainty-Aware Self-Reflection - paper

Data-Centric Human Preference Alignment

Training language models to follow instructions with human feedback - paper
LLaMA-VID: An image is worth 2 tokens in large language models - paper
Aligning large multimodal models with factually augmented rlhf - paper
Dress: Instructing large vision-language models to align and interact with humans via natural language feedback - paper

Evaluation

Gans trained by a two time-scale update rule converge to a local nash equilibrium - paper
Assessing generative models via precision and recall - paper
Unsupervised Quality Estimation for Neural Machine Translation - paper
Mixture models for diverse machine translation: Tricks of the trade - paper
The vendi score: A diversity evaluation metric for machine learning - paper
Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning - paper
Navigating text-to-image customization: From lycoris fine-tuning to model evaluation - paper
TRUE: Re-evaluating factual consistency evaluation - paper
Object hallucination in image captioning - paper
Faithscore: Evaluating hallucinations in large vision-language models - paper
Deep coral: Correlation alignment for deep domain adaptation - paper
Transferability in deep learning: A survey - paper
Mauve scores for generative models: Theory and practice - paper
Translating Videos to Natural Language Using Deep Recurrent Neural Networks - paper

Evaluation Datasets:

Dataset	Modality	Type	Link
MMMU	Image	Caption and General VQA	https://mmmu-benchmark.github.io
MME	Image	Caption and General VQA	https://arxiv.org/abs/2306.13394
Nocaps	Image	Caption and General VQA	https://github.com/nocaps-org
GQA	Image	Caption and General VQA	https://cs.stanford.edu/people/dorarad/gqa/about.html
DVQA	Image	Caption and General VQA	https://github.com/kushalkafle/DVQA_dataset
VSR	Image	Caption and General VQA	https://github.com/cambridgeltl/visual-spatial-reasoning?tab=readme-ov-file
OKVQA	Image	Caption and General VQA	https://okvqa.allenai.org/
Vizwiz	Image	Caption and General VQA	https://vizwiz.org/
POPE	Image	Caption and General VQA	https://github.com/RUCAIBox/POPE
TextVQA	Image	Text-Oriented VQA	https://textvqa.org/
DocVQA	Image	Text-Oriented VQA	https://www.docvqa.org/
ChartQA	Image	Text-Oriented VQA	https://github.com/vis-nlp/ChartQA
AI2D	Image	Text-Oriented VQA	https://allenai.org/data/diagrams
OCR-VQA	Image	Text-Oriented VQA	https://ocr-vqa.github.io/
ScienceQA	Image	Text-Oriented VQA	https://scienceqa.github.io/
MathV	Image	Text-Oriented VQA	https://mathvista.github.io/
MMVet	Image	Text-Oriented VQA	https://github.com/yuweihao/MM-Vet
RefCOCO, RefCOCO+, RefCOCOg	Image	Referring Expression Comprehension	https://github.com/lichengunc/refer
GRIT	Image	Referring Expression Comprehension	https://allenai.org/project/grit/home
TouchStone	Image	Instruction Following	https://allenai.org/project/grit/home
SEED-Bench	Image	Instruction Following	https://allenai.org/project/grit/home
MME	Image	Instruction Following	https://allenai.org/project/grit/home
LLaVAW	Image	Instruction Following	https://github.com/haotian-liu/LLaVA
HM	Image	Other	https://ai.meta.com/blog/hateful-memes-challenge-and-data-set/
MMB	Image	Other	https://github.com/open-compass/MMBench
MSVD	Video	Video question answering	https://paperswithcode.com/dataset/msvd
MSRVTT	Video	Video question answering	https://paperswithcode.com/dataset/msr-vtt
TGIF-QA	Video	Video question answering	https://paperswithcode.com/dataset/tgif-qa
ActivityNet-QA	Video	Video question answering	https://paperswithcode.com/dataset/activitynet-qa
LSMDC	Video	Video question answering	https://paperswithcode.com/dataset/lsmdc
MoVQA	Video	Video question answering	https://arxiv.org/abs/2312.04817
DiDeMo	Video	Video captioning and Video retrieval	https://paperswithcode.com/dataset/didemo
VATEX	Video	Video captioning and Video retrieval	https://eric-xw.github.io/vatex-website/about.html
MVBench	Video	Other	https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2
EgoSchema	Video	Other	https://egoschema.github.io/
VideoChatGPT	Video	Other	https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main
Charade-STA	Video	Other	https://github.com/jiyanggao/TALL
QVHighlight	Video	Other	https://github.com/jayleicn/moment_detr/tree/main/data
AudioCaps	Audio	Audio retrieval	https://audiocaps.github.io/
Clotho	Audio	Audio retrieval	https://zenodo.org/records/4743815
ClothoAQA	Audio	Audio question answering	https://zenodo.org/records/6473207
Audio-MusicAVQA	Audio	Audio question answering	https://gewu-lab.github.io/MUSIC-AVQA/

beccabai / Data-centric_multimodal_LLM