LLM-Vision-Datasets

In the era of large language models (LLMs), this repository is dedicated to collecting datasets, particularly focusing on image and video data for generative AI (such as diffusion models) and image-text paired data for multimodal models. We hope that the datasets shared by the community can help everyone better learn and train new AI models. By collaborating and sharing resources, we can accelerate the progress towards achieving Artificial General Intelligence (AGI). Let's work together to create a comprehensive collection of datasets that will serve as a valuable resource for researchers, developers, and enthusiasts alike, ultimately paving the way for the realization of AGI.

Some data is more suitable for generative AI, while other data is more suitable for VLM-related tasks. Some data is suitable for both.

If there is any infringement or inappropriate content, please contact us, and we will strive to prevent such content from appearing and handle it as soon as possible.

Generative AI (Image Datasets)
- General image Datasets
- Datasets for Virtual Try-on
Generative AI (Video Datasets)
Multimodal Model Datasets
Other Datasets
tools
Awesome Datasets
How to Add a New Dataset

Generative AI (Image)

High-resolution image data sharing, not limited to real-life or anime.

General Datasets image

Name	Describe	URL
LAION	Released LAION-400M, LAION-5B, and other ultra-large image-text datasets, as well as various types of CLIP data.	https://laion.ai/projects/ https://huggingface.co/laion
Conceptual Captions Dataset	Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.	https://github.com/google-research-datasets/conceptual-captions http://ai.google.com/research/ConceptualCaptions
laion-high-resolution-chinese	A subset from Laion5B-high-resolution (a multimodal dataset), around 2.66M image-text pairs (only Chinese).	https://huggingface.co/datasets/wanng/laion-high-resolution-chinese

Datasets for Virtual Try-on

Name	Description	URL
StreetTryOn	A new in-the-wild virtual try-on dataset consisting of 12,364 and 2,089 street person images for training and validation.	GitHub
CLOTH4D	A large-scale 4D dataset with 3D human, garment and texture models, SMPL pose parameters and HD images.	GitHub
DressCode	A dataset focused on modeling the underlying 3D geometry and appearance of a person and their garments given a few or a single image.	Download Form, Paper
VITON-HD	A high-resolution virtual try-on dataset with 13,679 image pairs at 1024 x 768 resolution.	Download, Project Page
VITON	The first image-based virtual try-on dataset with 16,253 image pairs.	Download, Paper
MPV	Multi-Pose Virtual try-on dataset containing 35,687/13,524 person/clothing images.	Download, Paper
Deep Fashion3D	A large-scale 3D garment dataset with diverse garment styles and rich annotations.	Paper
DeepFashion MultiModal	A dataset for multi-modal virtual try-on containing unpaired people and garment images.	Download
Digital Wardrobe	High-quality 3D garments from real consumer photos with 2D-3D alignment annotations.	Download/Paper/Project
TailorNet Dataset	Paired images of clothed 3D humans with consistent geometry and pose for garment transfer.	Download, Project
CLOTH3D	The first 3D garment dataset with digital garments and 3D human models.	Paper
3DPeople	3D human dataset with 80 subjects wearing diverse clothing and poses.	Project
THUman Dataset	High-resolution 3D textured human body dataset with 7000+ models of 200+ subjects.	Project
Garment Dataset	3D garment dataset with digital garments fitted to real people and garment images.	Project

Generative AI (Video)

High-resolution video data sharing, not limited to real-life or anime.

General Datasets (video)

Name	Describe	URL

Multimodal Model Datasets

This section includes datasets for multimodal models,requiring at least images and corresponding captions, suitable for training multimodal large models.

Datasets of Pre-Training for Alignment

Name	Describe	URL
LAION	Released LAION-400M, LAION-5B, and other ultra-large image-text datasets, as well as various types of CLIP data.	https://laion.ai/projects/ https://huggingface.co/laion
Conceptual Captions Dataset	Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.	https://github.com/google-research-datasets/conceptual-captions http://ai.google.com/research/ConceptualCaptions
COYO-700M	COYO-700M: Large-scale Image-Text Pair Dataset for training and evaluation of machine learning-based image-text matching models.	https://github.com/kakaobrain/coyo-dataset/
ShareGPT4V	ShareGPT4V: A large image-text dataset with captions generated by GPT-4 to improve multi-modal models.	https://arxiv.org/pdf/2311.12793.pdf
AS-1B	The All-Seeing Project dataset with over 1 billion regions annotated with semantic tags, QA pairs, and captions, for panoptic visual recognition.	https://arxiv.org/pdf/2308.01907.pdf
InternVid	InternVid: A large-scale video-text dataset for multimodal understanding and generation.	https://arxiv.org/pdf/2307.06942.pdf
MS-COCO	Microsoft COCO: A large-scale object detection, segmentation, and captioning dataset.	https://arxiv.org/pdf/1405.0312.pdf
SBU Captions	SBU Captioned Photo Dataset containing 1 million images with user-associated captions harvested from Flickr.	https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
Conceptual Captions	Conceptual Captions: A cleaned, web-scraped image alt-text dataset for training image captioning models.	https://aclanthology.org/P18-1238.pdf
LAION-400M	LAION-400M: An open, large-scale dataset of 400 million CLIP-filtered image-text pairs.	https://arxiv.org/pdf/2111.02114.pdf https://laion.ai/projects/ https://huggingface.co/laion
VG Captions	Visual Genome dataset connecting structured image concepts to language with crowdsourced annotations.	https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf
Flickr30k	Flickr30k Entities: A dataset of 30k images with 5 captions each, annotated with bounding boxes and entity mentions.	https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf
AI-Caps	AI Challenger: A large-scale Chinese dataset with millions of images and natural language descriptions.	https://arxiv.org/pdf/1711.06475.pdf
Wukong Captions	Wukong: A 100 million scale Chinese cross-modal pre-training benchmark dataset.	https://proceedings.neurips.cc/paper_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf
GRIT	Grounded Multimodal Language Model dataset containing images aligned with text segments.	https://arxiv.org/pdf/2306.14824.pdf
Youku-mPLUG	Youku-mPLUG: A 10 million scale Chinese video-language pre-training dataset.	https://arxiv.org/pdf/2306.04362.pdf
MSR-VTT	MSR-VTT: A large video description dataset for bridging video and language.	https://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR-VTT_A_Large_CVPR_2016_paper.pdf
Webvid10M	Webvid-10M: A large-scale video-text dataset for joint video-language representation learning.	https://arxiv.org/pdf/2104.00650.pdf
WavCaps	WavCaps: A ChatGPT-assisted weakly-labeled audio captioning dataset.	https://arxiv.org/pdf/2303.17395.pdf
AISHELL-1	AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline.	https://arxiv.org/pdf/1709.05522.pdf
AISHELL-2	AISHELL-2: An industrial-scale Mandarin speech recognition dataset.	https://arxiv.org/pdf/1808.10583.pdf
VSDial-CN	Visual Semantic Dialog in Chinese: an image-audio-text dataset for studying multimodal language models.	https://arxiv.org/pdf/2305.04160.pdf

Datasets of Multimodal Instruction Tuning

Name	Describe	URL
CogVLM-SFT-311K	CogVLM-SFT-311K, a crucial aligned corpus for initializing CogVLM v1.0, was constructed by first selecting approximately 3,500 high-quality data samples from the open-source MiniGPT-4 (minigpt4-3500). This subset was then combined with Llava-Instruct-150K and machine-translated into Chinese.	https://github.com/THUDM/CogVLM/blob/main/dataset.md
ALLaVA-4V	Multimodal instruction dataset generated by GPT4V	https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V
IDK	Dehallucinative visual instruction for "I Know" hallucination	https://github.com/ncsoft/idk
CAP2QA	Image-aligned visual instruction dataset	https://github.com/ncsoft/cap2qa
M3DBench	A large-scale 3D instruction tuning dataset	https://github.com/OpenM3D/M3DBench
ViP-LLaVA-Instruct	A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data	https://huggingface.co/datasets/mucai/ViP-LLaVA-Instruct
LVIS-Instruct4V	A visual instruction dataset via self-instruction from GPT-4V	https://huggingface.co/datasets/X2FD/LVIS-Instruct4V
ComVint	A synthetic instruction dataset for complex visual reasoning	https://github.com/RUCAIBox/ComVint#comvint-data
SparklesDialogue	A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.	https://github.com/HYPJUDY/Sparkles#sparklesdialogue
StableLLaVA	A cheap and effective approach to collect visual instruction tuning data	https://github.com/icoz69/StableLLAVA
M-HalDetect	A dataset used to train and benchmark models for hallucination detection and prevention	Coming soon
MGVLID	A high-quality instruction-tuning dataset including image-text and region-text pairs	-
BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	https://huggingface.co/datasets/magicr/BuboGPT
SVIT	SVIT: Scaling up Visual Instruction Tuning	https://huggingface.co/datasets/BAAI/SVIT
mPLUG-DocOwl	mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocLLM
PF-1M	Visual Instruction Tuning with Polite Flamingo	https://huggingface.co/datasets/chendelong/PF-1M/tree/main
ChartLlama	ChartLlama: A Multimodal LLM for Chart Understanding and Generation	https://huggingface.co/datasets/listen2you002/ChartLlama-Dataset
LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	https://llavar.github.io/#data
MotionGPT	MotionGPT: Human Motion as a Foreign Language	https://github.com/OpenMotionLab/MotionGPT
LRV-Instruction	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	https://github.com/FuxiaoLiu/LRV-Instruction#visual-instruction-data-lrv-instruction
Macaw-LLM	Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	https://github.com/lyuchenyang/Macaw-LLM/tree/main/data
LAMM-Dataset	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	https://github.com/OpenLAMM/LAMM#lamm-dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	https://github.com/mbzuai-oryx/Video-ChatGPT#video-instruction-dataset-open_file_folder
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	https://github.com/Luodian/Otter/blob/main/mimic-it/README.md
M³IT	M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	https://huggingface.co/datasets/MMInstruction/M3IT
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	Coming soon
GPT4Tools	GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	Link
MULTIS	ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	Coming soon
DetGPT	DetGPT: Detect What You Need via Reasoning	Link
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	Coming soon
VideoChat	VideoChat: Chat-Centric Video Understanding	Link
X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	Link
LMEye	LMEye: An Interactive Perception Network for Large Language Models	Link
cc-sbu-align	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Link
LLaVA-Instruct-150K	Visual Instruction Tuning	Link
MultiInstruct	MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	Link

Other Datasets

Other datasets that are not easily categorized at the moment.

Name	Describe	URL

Tools

Tools that aid in data acquisition and cleaning.

Name	Describe	URL
img2dataset	Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.	https://github.com/rom1504/img2dataset

Awesome datasets

Thanks to the help from the following Awesome repositories.

How to Add a New Dataset

To contribute a new dataset to this repository, please follow these steps:

Fork this repository to your own GitHub account.
Add your dataset to the appropriate section in the README file, following the format:

| Name  | Describe  | URL     |
|-|-|-|
|DatasetsName | DatasetsDescribe | DatasetsURL1 <br> DatasetsURL2 <br> DatasetsURL3 |

For example:

| Name  | Describe  | URL     |
|-|-|-|
| LAION | Released LAION-400M, LAION-5B, and other ultra-large image-text datasets, as well as various types of CLIP data. | https://laion.ai/projects/  <br>  https://huggingface.co/laion |

If your dataset doesn't fit into any of the existing categories, create a new section for it in the README file.

Commit and push, Create a pull request.

By following these steps, you can help expand the collection of datasets available in this repository and contribute to the advancement of generative AI and multimodal visual AI research.

Contributions of more datasets related to generative AI and multimodal visual AI are welcome! If you have any suggestions or comments, please feel free to open an issue.

sanbuphy / llm-vision-datasets