Wenhai Wang's starred repositories
Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.
IP-Adapter
The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
chatgpt-prompts-for-academic-writing
This list of writing prompts covers a range of topics and tasks, including brainstorming research ideas, improving language and style, conducting literature reviews, and developing research plans.
InternLM-XComposer
InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.
UniRepLKNet
[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
InterFuser
[CoRL 2022] InterFuser: Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer
all-seeing
[ICLR 2024] This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World"
DCI-VTON-Virtual-Try-On
[ACM Multimedia 2023] Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow.
Mini-DALLE3
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models
MultimodalOCR
On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)
Vision-RWKV
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
ControlLLM
ControlLLM: Augment Language Models with Tools by Searching on Graphs
AVSegFormer
[AAAI 2024] AVSegFormer: Audio-Visual Segmentation with Transformer