Object Recognition as Next Token Prediction

arXiv | Colab | Documentation | Hugging Face

Top 30 predictions with probabilities from our model on the image of "The Legend of Zelda: Tears of the Kingdom" ¹.

Introduction

This is the official PyTorch implementation for the paper Object Recognition as Next Token Prediction accepted at CVPR 2024 (Highlight).

@inproceedings{nxtp,
  title     = {{Object Recognition as Next Token Prediction}},
  author    = {Kaiyu Yue and Bor-Chun Chen and Jonas Geiping and Hengduo Li and Tom Goldstein and Ser-Nam Lim},
  booktitle = {Computer Vision and Pattern Recognition Conference (CVPR)},
  year      = {2024}
}

Updates

May 26, 2024

add ImageNet experiments: see src/imagenet
visualize attention maps in decoder layers during inference: see examples

Mar 17, 2024

release the best 1.78B model trained on G70M
export onnx models: docs/onnx-export

Mar 03, 2024

add examples with top-20 predictions to this readme
add CLIP ViT- L/14 as the textual embedding model in evaluation metric (Table A.8 of the paper)

Method

This project delves into a fundamental problem in computer vision − object recognition − translating an image into object labels.

Linear models (such as ResNet) and contrastive models (such as CLIP) require predefined labels before inference, limiting their flexibility in real-world applications.

We extend W to cover the entire textual space using language models like LLaMA's 32K token embeddings. Our model predicts labels in a real-open manner through auto-regressive processing.

Additionally, our one-shot sampling technique enables efficient large-scale discriminative predictions, such as the top-100 labels.

The released models have 1.78B parameters. Truncating the model to 0.77B parameters still achieves competitive performance (Table 3 in the paper), which only has one transformer block in the decoder.

Examples

Image w/ Top-20 Predictions	Attention Map	Image w/ Top-20 Predictions	Attention Map
click to review ¹ `prob: 0.13949 - legend` `prob: 0.12399 - sky` `prob: 0.04723 - cloud` `prob: 0.04642 - game` `prob: 0.04500 - screenshot` `prob: 0.03189 - top` `prob: 0.03024 - mountain` `prob: 0.02262 - cliff` `prob: 0.01790 - world` `prob: 0.01483 - wii` `prob: 0.01440 - video` `prob: 0.01310 - breath` `prob: 0.01087 - zeo` `prob: 0.00982 - zelda` `prob: 0.00959 - character` `prob: 0.00865 - rock` `prob: 0.00816 - link` `prob: 0.00788 - island` `prob: 0.00624 - adventure` `prob: 0.00591 - woman`	attention map info `decoder: layer 0: head 25`	click to review ² `prob: 0.23237 - rocket` `prob: 0.10435 - launch` `prob: 0.06144 - soyuz` `prob: 0.04314 - space` `prob: 0.03541 - smoke` `prob: 0.03249 - sky` `prob: 0.01971 - shuttle` `prob: 0.01566 - tower` `prob: 0.01551 - paris` `prob: 0.01229 - cloud` `prob: 0.01067 - pad` `prob: 0.01050 - cape` `prob: 0.00983 - falcon` `prob: 0.00956 - photo` `prob: 0.00834 - lift` `prob: 0.00814 - air` `prob: 0.00779 - mission` `prob: 0.00710 - station` `prob: 0.00688 - july` `prob: 0.00647 - satellite`	attention map info `decoder: layer 0: head 0`
click to review ³ `prob: 0.30731 - dog` `prob: 0.13647 - sweater` `prob: 0.11870 - hat` `prob: 0.06812 - scarf` `prob: 0.04131 - brick` `prob: 0.03114 - wall` `prob: 0.01796 - shirt` `prob: 0.01471 - cute` `prob: 0.01156 - cap` `prob: 0.00982 - neck` `prob: 0.00929 - top` `prob: 0.00797 - head` `prob: 0.00777 - beanie` `prob: 0.00658 - man` `prob: 0.00588 - sits` `prob: 0.00582 - coat` `prob: 0.00524 - jacket` `prob: 0.00476 - collar` `prob: 0.00460 - face` `prob: 0.00119 - bone`	attention map info `decoder: layer 0: head 25`	click to review ⁴ `prob: 0.14861 - coffee` `prob: 0.10409 - shop` `prob: 0.08065 - counter` `prob: 0.04603 - bar` `prob: 0.04055 - restaurant` `prob: 0.03691 - inside` `prob: 0.03468 - area` `prob: 0.02638 - store` `prob: 0.02219 - table` `prob: 0.01930 - interior` `prob: 0.01347 - lot` `prob: 0.01156 - food` `prob: 0.01058 - customer` `prob: 0.01001 - room` `prob: 0.00923 - starbucks` `prob: 0.00853 - bakery` `prob: 0.00738 - view` `prob: 0.00738 - floor` `prob: 0.00733 - cafe` `prob: 0.00633 - shelf`	attention map info `decoder: layer 0: head 8`
click to review ³ `prob: 0.47652 - monster` `prob: 0.09664 - cartoon` `prob: 0.03812 - character` `prob: 0.03724 - group` `prob: 0.03312 - creature` `prob: 0.02111 - cute` `prob: 0.01929 - vector` `prob: 0.01481 - animal` `prob: 0.00955 - art` `prob: 0.00924 - alien` `prob: 0.00837 - pose` `prob: 0.00604 - bubble` `prob: 0.00553 - eye` `prob: 0.00533 - color` `prob: 0.00528 - hand` `prob: 0.00477 - design` `prob: 0.00474 - wallpaper` `prob: 0.00462 - child` `prob: 0.00445 - people` `prob: 0.00445 - family`	attention map info `decoder: layer 2: head 7`	click to review ³ `prob: 0.54375 - cloud` `prob: 0.09932 - word` `prob: 0.07571 - sky` `prob: 0.03153 - letter` `prob: 0.01862 - sora` `prob: 0.01380 - logo` `prob: 0.00995 - text` `prob: 0.00715 - top` `prob: 0.00715 - blue` `prob: 0.00677 - title` `prob: 0.00608 - photo` `prob: 0.00427 - picture` `prob: 0.00288 - sonora` `prob: 0.00269 - middle` `prob: 0.00257 - storm` `prob: 0.00202 - cloudscape` `prob: 0.00190 - sun` `prob: 0.00189 - art` `prob: 0.00156 - soar` `prob: 0.00041 - icy`	attention map info `decoder: layer 1: head 13`
click to review ³ `prob: 0.15317 - building` `prob: 0.13619 - wave` `prob: 0.04782 - room` `prob: 0.03498 - middle` `prob: 0.03188 - hall` `prob: 0.02367 - people` `prob: 0.02135 - ocean` `prob: 0.02087 - floor` `prob: 0.01867 - world` `prob: 0.01773 - inside` `prob: 0.01548 - man` `prob: 0.01380 - water` `prob: 0.01205 - view` `prob: 0.01200 - surfer` `prob: 0.01109 - photo` `prob: 0.00798 - hotel` `prob: 0.00734 - city` `prob: 0.00662 - pool` `prob: 0.00566 - art` `prob: 0.00319 - mural`	attention map info `decoder: layer 1: head 16`	click to review ³ `prob: 0.25673 - bird` `prob: 0.21676 - feather` `prob: 0.18550 - peacock` `prob: 0.04251 - head` `prob: 0.03240 - blue` `prob: 0.02507 - pigeon` `prob: 0.02183 - tail` `prob: 0.01339 - hair` `prob: 0.01187 - top` `prob: 0.00677 - face` `prob: 0.00631 - camera` `prob: 0.00463 - beak` `prob: 0.00451 - eye` `prob: 0.00419 - fence` `prob: 0.00370 - sits` `prob: 0.00333 - perch` `prob: 0.00330 - photo` `prob: 0.00318 - wall` `prob: 0.00269 - animal` `prob: 0.00106 - jay`	attention map info `decoder: layer 1: head 25`
click to review ⁵ `prob: 0.07247 - tablet` `prob: 0.06770 - coffee` `prob: 0.06562 - window` `prob: 0.05829 - controller` `prob: 0.05668 - game` `prob: 0.04802 - switch` `prob: 0.04043 - wii` `prob: 0.03798 - console` `prob: 0.03563 - cup` `prob: 0.02570 - top` `prob: 0.02067 - mug` `prob: 0.01808 - screen` `prob: 0.01344 - video` `prob: 0.01105 - star` `prob: 0.01092 - nintendo` `prob: 0.01055 - computer` `prob: 0.00819 - mario` `prob: 0.00815 - remote` `prob: 0.00736 - control` `prob: 0.00393 - sill`	attention map info `decoder: layer 0: head 12`	click to review ⁶ `prob: 0.36523 - airplane` `prob: 0.09151 - cargo` `prob: 0.07531 - plane` `prob: 0.05538 - ship` `prob: 0.04223 - container` `prob: 0.03105 - water` `prob: 0.03040 - view` `prob: 0.02277 - dock` `prob: 0.01685 - port` `prob: 0.01434 - sky` `prob: 0.01328 - shipping` `prob: 0.00788 - middle` `prob: 0.00751 - body` `prob: 0.00717 - photo` `prob: 0.00715 - jet` `prob: 0.00714 - city` `prob: 0.00621 - ocean` `prob: 0.00615 - freight` `prob: 0.00609 - boat` `prob: 0.00320 - transportation`	attention map info `decoder: layer 2: head 14`
click to review ⁶ `prob: 0.15236 - candy` `prob: 0.12271 - sweater` `prob: 0.11457 - glass` `prob: 0.10593 - dog` `prob: 0.08311 - chair` `prob: 0.07111 - cane` `prob: 0.04701 - sunglass` `prob: 0.04589 - christmas` `prob: 0.02361 - costume` `prob: 0.02085 - wearing` `prob: 0.01870 - hat` `prob: 0.00734 - head` `prob: 0.00636 - top` `prob: 0.00577 - outfit` `prob: 0.00520 - chocolate` `prob: 0.00437 - holi` `prob: 0.00362 - suit` `prob: 0.00344 - shirt` `prob: 0.00322 - strawberry` `prob: 0.00211 - wig`	attention map info `decoder: layer 1: head 16`	click to review ⁶ `prob: 0.19960 - living` `prob: 0.16291 - room` `prob: 0.11353 - sofa` `prob: 0.06036 - couch` `prob: 0.04741 - rug` `prob: 0.04704 - coffee` `prob: 0.03795 - dog` `prob: 0.03659 - wall` `prob: 0.02980 - table` `prob: 0.01611 - floor` `prob: 0.01594 - grey` `prob: 0.01472 - wood` `prob: 0.01353 - furniture` `prob: 0.01314 - plant` `prob: 0.01274 - fireplace` `prob: 0.01161 - pillow` `prob: 0.00941 - chair` `prob: 0.00512 - home` `prob: 0.00434 - blanket` `prob: 0.00351 - art`	attention map info `decoder: layer 1: head 16`

Models

The following table shows the reproduced results of recall (R column in Table 1 of the paper) on the validation splits with top-10 predictions.

# params	training group	checkpoint	md5	CC3M	COCO	OpenImages
1.78B	G3M	Hugging Face	`b2a69b`	0.740	0.703	0.616
1.78B	G70M	Hugging Face	`e177c7`	0.721	0.765	0.662

Downloading

The checkpoints can be downloaded from the links in the table above. For downloading from Hugging Face, one option is to use git-lfs:

# install git lfs
git lfs install

# download the checkpoint in terminal
git clone https://huggingface.co/kaiyuyue/nxtp

Also, the checkpoint can be downloaded from the model page in the web browser.

Inference

There is an image assets/starbux.jpg for a quick test. First, please follow the instructions in Dependencies to prepare the environment.

To infer an image, please run

python src/infer.py \
  --ckpt-path path/to/model/checkpoint \
  --img-path assets/starbux.jpg \
  --num-labels 20

The output from model trained on G3M will be

top-20 predictions:
| prob: 0.05742 - coffee
| prob: 0.05525 - restaurant
| prob: 0.04402 - shop
| prob: 0.02528 - room
| prob: 0.02468 - store
| prob: 0.02381 - interior
| prob: 0.01732 - area
| prob: 0.01640 - building
| prob: 0.01616 - food
| prob: 0.01408 - bar
| prob: 0.01247 - customer
| prob: 0.01134 - view
| prob: 0.01059 - floor
| prob: 0.01045 - table
| prob: 0.00933 - kitchen
| prob: 0.00926 - home
| prob: 0.00872 - look
| prob: 0.00841 - people
| prob: 0.00693 - cup
| prob: 0.00665 - counter

The output from model trained on G70M is

top-20 predictions:
| prob: 0.15203 - coffee
| prob: 0.09728 - shop
| prob: 0.09182 - counter
| prob: 0.03848 - interior
| prob: 0.03389 - bar
| prob: 0.03215 - restaurant
| prob: 0.02440 - table
| prob: 0.02245 - store
| prob: 0.01950 - area
| prob: 0.01905 - inside
| prob: 0.01590 - starbucks
| prob: 0.01313 - cafe
| prob: 0.01220 - chair
| prob: 0.01172 - floor
| prob: 0.01020 - cup
| prob: 0.00879 - drink
| prob: 0.00794 - room
| prob: 0.00746 - customer
| prob: 0.00635 - wood
| prob: 0.00345 - bakery

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Image credit: ゼルダの伝説 The Legend of Zelda: Tears of the Kingdom. ↩ ↩²
Image credit: Space-X. ↩
Image credit: OpenAI Sora. ↩ ↩² ↩³ ↩⁴ ↩⁵
Image credit: Photo taken by the author at a Starbucks store. ↩
Image credit: Super Mario Bros Wonder. ↩
Image credit: Demo in Segment Anything | Meta AI. ↩ ↩² ↩³

kaiyuyue / nxtp

Object Recognition as Next Token Prediction

Introduction

Updates

Method

Examples

Models

Downloading

Inference

License

About

Languages

Object Recognition as Next Token Prediction

Introduction

Updates

Method

Examples

Models

Downloading

Inference

License

Footnotes

About

Languages