There are 0 repository under captioning topic.
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.
Code for "Aligning Linguistic Words and Visual Semantic Units for Image Captioning", ACM MM 2019
CapDec: SOTA Zero Shot Image Captioning Using CLIP and GPT2, EMNLP 2022 (findings)
Audio Captioning datasets for PyTorch.
A Tennis dataset and models for event detection & commentary generation
Fully-Convolutional Point Networks for Large-Scale Point Clouds
Python code for handling the Clotho dataset.
A Base Tensorflow Project for Medical Report Generation
What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment [CVPR 2019]
[CVPR 2023 & IJCV 2025] Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.
[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)
A Pytorch implementation of Attention on Attention module (both self and guided variants), for Visual Question Answering
Using LLMs and pre-trained caption models for super-human performance on image captioning.
Audio captioning baseline system for DCASE 2020 challenge.
[CVPR 2022] X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
Toolkit for supporting the EBU-TT Live specification
Some papers about *diverse* image (a few videos) captioning
My notes on some Deep Learning papers
A curated list of zero-shot captioning papers
S2VT (seq2seq) video captioning with bahdanau & luong attention implementation in Tensorflow
[ICCV 2023] With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning.
Tools for the evaluation of audio captioning.
Official python implementation of R3-Transformer
Automated reddit scraper and video creator
ide-cap-chan is a utility for batch image captioning with natural language using various VL models
[NLPCC'23] ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles PyTorch Implementation
Indonesian Image Captioning using Attention-based Semantic Compositional Networks
Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
Single-stream Extractor Network with Contrastive Pre-training for Remote Sensing Change Captioning