kyegomez / Kosmos2.5

My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"

Home Page:https://discord.gg/qUtxnK2NMf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi-Modality

Kosmos2.5

My implementation of Kosmos2.5 from Microsoft research and the paper: "KOSMOS-2.5: A Multimodal Literate Model"

Paper Link

Appreciation

  • Lucidrains
  • Agorians

Install

pip install kosmos2-torch

Usage

import torch
from kosmos.model import Kosmos

#usage
img = torch.randn(1, 3, 256, 256)
text = torch.randint(0, 20000, (1, 1024))

model = Kosmos()
output = model(img, text)
print(output)

Dataset Strategy

Here is a table summarizing the datasets used in the paper KOSMOS-2.5: A Multimodal Literate Model with metadata and source links:

Dataset Modality # Samples Domain Source
IIT-CDIP Text + Layout 27.6M pages Scanned documents Link
arXiv papers Text + Layout 20.9M pages Research papers Link
PowerPoint slides Text + Layout 6.2M pages Presentation slides Web crawl
General PDF Text + Layout 155.2M pages Diverse PDF files Web crawl
Web screenshots Text + Layout 100M pages Webpage screenshots Link
README Text + Markdown 2.9M files GitHub README files Link
DOCX Text + Markdown 1.1M pages WORD documents Web crawl
LaTeX Text + Markdown 3.7M pages Research papers Link
HTML Text + Markdown 6.3M pages Webpages Link

License

MIT

Citations

@misc{2309.11419,
Author = {Tengchao Lv and Yupan Huang and Jingye Chen and Lei Cui and Shuming Ma and Yaoyao Chang and Shaohan Huang and Wenhui Wang and Li Dong and Weiyao Luo and Shaoxiang Wu and Guoxin Wang and Cha Zhang and Furu Wei},
Title = {Kosmos-2.5: A Multimodal Literate Model},
Year = {2023},
Eprint = {arXiv:2309.11419},
}

bold italics

About

My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"

https://discord.gg/qUtxnK2NMf

License:MIT License


Languages

Language:Python 100.0%