AliAlfatemi / CV-for-SC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Image Captioning with Vision Transformer and LLMS

This repository contains the implementation of an image captioning model that integrates Vision Transformer (ViT) and GPT-J to generate descriptive captions for images. The model is built using the Hugging Face Transformers library and is trained on the COCO dataset.

Project Overview

The project aims to explore the capabilities of combining advanced vision and language models to generate accurate and contextually relevant descriptions of images. The VisionEncoderDecoder framework is used to fuse the ViT model as the encoder and GPT-J as the decoder.

Getting Started

Prerequisites

  • Python 3.8 or above
  • PyTorch 1.8 or above
  • Transformers 4.0 or above
  • Datasets
  • PIL
  • Pandas
  • NumPy

About


Languages

Language:Python 100.0%