OpenAI-CLIP

Multi-Modal

CLIP

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

Working of a CLIP

Multi-modal ML with OpenAI's CLIP

Language models (LMs) can not rely on language alone. That is the idea behind the “Experience Grounds Language” paper, that proposes a framework to measure LMs' current and future progress. A key idea is that, beyond a certain threshold LMs need other forms of data, such as visual input

The next step beyond well-known language models; BERT, GPT-3, and T5 is ”World Scope 3”. In World Scope 3, we move from large text-only datasets to large multi-modal datasets. That is, datasets containing information from multiple forms of media, like both images and text.

chonsram19 / OpenAI-CLIP

OpenAI-CLIP

CLIP

Working of a CLIP

Multi-modal ML with OpenAI's CLIP

About

Languages