There are 19 repositories under visual-question-answering topic.
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome
Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch
PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"
A collection of resources on applications of multi-modal learning in medical imaging.
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
Strong baseline for visual question answering
Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17
[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
[AAAI 2024] NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario.
This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
Document Visual Question Answering
a collection of computer vision projects&tools. 计算机视觉方向项目和工具集合。
A pytorch implementation for "A simple neural network module for relational reasoning", working on the CLEVR dataset
CNN+LSTM, Attention based, and MUTAN-based models for Visual Question Answering
Bottom-up features extractor implemented in PyTorch.
[Paper][ISWC 2021] Zero-shot Visual Question Answering using Knowledge Graph
PyTorch VQA implementation that achieved top performances in the (ECCV18) VizWiz Grand Challenge: Answering Visual Questions from Blind People
Large Language Models are Temporal and Causal Reasoners for Video Question Answering (EMNLP 2023)
Code for paper title "Learning Semantic Sentence Embeddings using Pair-wise Discriminator" COLING-2018
PyTorch implementation of FiLM: Visual Reasoning with a General Conditioning Layer