chonsram19 / OpenAI-CLIP

Multi-Model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OpenAI-CLIP

Multi-Modal

CLIP

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

Working of a CLIP

CLIP

Multi-modal ML with OpenAI's CLIP

Language models (LMs) can not rely on language alone. That is the idea behind the “Experience Grounds Language” paper, that proposes a framework to measure LMs' current and future progress. A key idea is that, beyond a certain threshold LMs need other forms of data, such as visual input

image

The next step beyond well-known language models; BERT, GPT-3, and T5 is ”World Scope 3”. In World Scope 3, we move from large text-only datasets to large multi-modal datasets. That is, datasets containing information from multiple forms of media, like both images and text.

About

Multi-Model


Languages

Language:Jupyter Notebook 100.0%