Image Captioning with Vision Transformer and LLMS

This repository contains the implementation of an image captioning model that integrates Vision Transformer (ViT) and GPT-J to generate descriptive captions for images. The model is built using the Hugging Face Transformers library and is trained on the COCO dataset.

Project Overview

The project aims to explore the capabilities of combining advanced vision and language models to generate accurate and contextually relevant descriptions of images. The VisionEncoderDecoder framework is used to fuse the ViT model as the encoder and GPT-J as the decoder.

Getting Started

Prerequisites

Python 3.8 or above
PyTorch 1.8 or above
Transformers 4.0 or above
Datasets
PIL
Pandas
NumPy

AliAlfatemi / CV-for-SC

Image Captioning with Vision Transformer and LLMS

Project Overview

Getting Started

Prerequisites

About

Languages