OpenGVLab / OmniCorpus

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Home Page:https://arxiv.org/abs/2406.08418

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OmniCorpus

[Paper] [OmniCorpus-YT] [OmniCorpus-CC-600M] [OmniCorpus-CC-200M] [OmniCorpus-CC-8M]

News๐Ÿš€๐Ÿš€๐Ÿš€

  • 2024/06/13: ๐Ÿš€We introduce OmniCorpus, a 10 billion-level image-text interleaved dataset. This dataset contains 8.6 billion images, 1,696 billion text tokens, and 2.2 billion documents!

Scedule

  • Release OmniCorpus-YT

  • Release OmniCorpus-CC-600M

  • Release OmniCorpus-CC-200M

  • Release OmniCorpus-CC-8M

Introduction

OmniCorpus dataset is the largest multimodal dataset to date, which pushes the boundaries of scale and diversity by encompassing 8.6 billion images interleaved with 1,696 text tokens from diverse sources, significantly surpassing previous datasets. This dataset demonstrates several advantages over its counterparts:

  1. Larger data scale: Our dataset is 1.7 times larger in images and 12.5 times larger in texts compared to the previously largest multimodal dataset, LAION-5B, while maintaining excellent data quality.

  2. Richer data diversity: Drawing from a broader range of data sources, our dataset is more diverse than other image-text interleaved datasets. It includes bilingual multimodal data in both Chinese and English, and encompasses text-centric and vision-centric documents extracted from common websites and video platforms.

  3. More flexible format: The streaming data format of our dataset offers exceptional flexibility, allowing adaptation to various data structures, including pure text corpora, image-text pairs, and interleaved data formats.

image

Some examples:

image image image

Data Pipeline

Our data pipeline consists of five key stages: main body extraction, preliminary text filtering, document deduplication, image downloading & filtering, and detailed text filtering. Each stage efficiently reduces the dataset to retain only high-quality data. Please refer to our paper for more details about the data pipeline.

image

Experimental Results

We conduct a series of experiments to evluate the effectiveness of OmniCorpus. As shown in the table below, model trained on our dataset demonstrates superior performance on academic caption and vqa benchmarks. Please refer to our paper for more experimental results.

image

About

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

https://arxiv.org/abs/2406.08418