There are 2 repositories under blip2 topic.
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
[ICRA 2024] Chat with NeRF enables users to interact with a NeRF model by typing in natural language.
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
Official code base for NeuroClips
Automate Fashion Image Captioning using BLIP-2. Automatic generating descriptions of clothes on shopping websites, which can help customers without fashion knowledge to better understand the features (attributes, style, functionality etc.) of the items and increase online sales by enticing more customers.
A true multimodal LLaMA derivative -- on Discord!
Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"
[ACM MM 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
The Multimodal Model for Vietnamese Visual Question Answering (ViVQA)
CLIP Interrogator, fully in HuggingFace Transformers 🤗, with LongCLIP & CLIP's own words and / or *your* own words!
Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
This repository is for profiling, extracting, visualizing and reusing generative AI weights to hopefully build more accurate AI models and audit/scan weights at rest to identify knowledge domains for risk(s).
Caption images across your datasets with state of the art models from Hugging Face and Replicate!
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Finetuning Large Visual Models on Visual Question Answering
Uses AI to scare people...more.
caption generator using lavis and argostranslate
In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension
Too lazy to organize my desktop, make gpt + BLIP-2 do it /s
An offline AI-powered video analysis tool with object detection (YOLO), image captioning (BLIP), speech transcription (Whisper), audio event detection (PANNs), and AI-generated summaries (LLMs via Ollama). It ensures privacy and offline use with a user-friendly GUI.
This project performs multimodal document analysis and query retrieval by downloading PDFs, converting pages to images, indexing them for semantic search, and analyzing retrieved images using visual-language models like Qwen2VL and Blip2.
Creating stylish social media captions for an Image using Multi Modal Models and Reinforcement Learning
Implementation of Qformer pre-training
A web-based application that leverages the BLIP-2 model to generate detailed descriptions of uploaded images.
AltTextGenerator is a Python-based tool leveraging the BLIP-2 model to generate meaningful alt text for images. It processes images from a folder, generates accurate alt text, and renames the images, aiding developers, content creators, and accessibility professionals.
An end to end Deep Learning based tool for image caption generation.
Winning solution for image captioning challenge at 2024 InThon Datathon