Awesome CVPR 2024 Papers, Workshops, Challenges, and Tutorials!

The 2024 Conference on Computer Vision and Pattern Recognition (CVPR) received 11,532 valid paper submissions, and only 2,719 were accepted, for an overall acceptance rate of about 23.6%.

Below is a list of the papers, posters, challenges, workshops, and datasets I'm most excited about.

I'll be there with my crew from Voxel 51 at Booth 1519, which will be located right next to the Meta and Amazon Science booths!

If you found the repo useful, come by and say "Hi" and I'll hook you up with some swag!

🏆 Challenges

Title	Authors	Code / arXiv Page	Summary
Agriculture-Vision Prize Challenge			The Agriculture-Vision Prize Challenge 2024 encourages the development of algorithms for recognizing agricultural patterns from aerial images and to promote sustainable agriculture practices. Semi-supervised learning techniques will be used to merge two datasets and assess model performance. Prizes are $2,500 for 1st place, $1,500 for 2nd place, and $1,000 for 3rd place.
Building3D Challenge			This challenge utilizes the Building3D dataset, an urban-scale publicly available dataset with over 160,000 buildings from 16 cities in Estonia. Participants must develop algorithms that take point clouds as input and generate wireframe models.
Structured Semantic 3D Reconstruction (S23DR) Challenge			Transform posed images or SfM outputs into wireframes for extracting semantically meaningful measurements. HoHo dataset provides images, point clouds, and wireframes with semantically tagged edges. $25,000 prize pool.
Pixel-level Video Understanding in the Wild			The PVUW challenge includes four tracks: Video Semantic Segmentation (VSS), Video Panoptic Segmentation (VPS), Complex Video Object Segmentation, and Motion Expression guided Video Segmentation[1]. The two new tracks, based on the MOSE and MeViS datasets, aim to foster the development of more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios.
SyntaGen Competition			The SyntaGen Competition challenges participants to create high-quality synthetic datasets using Stable Diffusion and the 20 class names from PASCAL VOC 2012 for semantic segmentation. The datasets will be evaluated by training a DeepLabv3 model and assessing its performance on a private test set, with submissions ranked based on the mIoU metric[1]. The top 2 teams will receive cash prizes and the opportunity to present their work at the workshop.
SMART-101 CVPR 2024 Challenge			The EvalAI challenge called "Anthropic Conversational AI Evaluation" has the objective of evaluating open-domain conversational AI systems based on their ability to engage in helpful, harmless, and honest conversations with humans[1]. The challenge comprises a multi-turn dialogue between a human and an AI assistant, where the human can ask the AI to perform open-ended tasks or engage in open-ended conversation[1]. The AI systems are evaluated on various metrics, including helpfulness, harmlessness, honesty, groundedness, and role consistency.
Snapshot Spectral Imaging Face Anti-spoofing Challenge			New spectroscopy sensors can improve facial recognition systems' ability to identify realistic flexible masks made of silicone or latex. Snapshot Spectral Imaging (SSI) technology obtains compressed sensing spectral images in a single exposure, making it useful for incorporating spectroscopic information. Using a snapshot spectral camera, we created HySpeFAS - the first snapshot spectral face anti-spoofing dataset with 6760 hyperspectral images, each containing 30 spectral channels. This competition aims to encourage research on new spectroscopic sensor face anti-spoofing algorithms suitable for SSI images.
Chalearn Face Anti-spoofing Workshop			Spoofing clues resulting from physical presentation attacks are caused by color distortion, screen moire patterns, and production traces. Forgery clues resulting from digital editing attacks are changes in pixel values. The fifth competition aims to explore common characteristics of these attack clues and promote unified detection algorithms. We have a Unified physical-digital Attack dataset, called UniAttackData, with 1,800 participations, 2 physical and 12 digital attacks, and 29,706 videos.
DataCV Challenge			The DataCV Challenge searches training sets for various targets in object detection. The datasets for the challenge consist of a data source pool, combining multiple existing detection datasets, and a newly introduced target dataset with diverse detection environments recorded across 100 countries. Test set A is publicly available on Github, while test set B is reserved for determining challenge awards. An evaluation server is provided for calculating test accuracy. Ethical considerations have been followed by blurring human faces and vehicle license plates to ensure individual privacy and validating copyright before distributing the datasets.
Grocery Vision			The GroceryVision Dataset is part of the RetailVision Workshop Challenge at CVPR 2024. It has two tracks that use real-world retail data collected in typical grocery store environments. Track 1 focuses on Video and Spatial Temporal Action Localization (TAL and STAL). Participants are provided with 73,683 image-annotation pairs for training, and their performance is evaluated based on frame-mAP for TAL and tube-mAP for STAL. Track 2 is the Multi-modal Product Retrieval (MPR) challenge. Participants must design methods to accurately retrieve product identity by measuring similarity between images and descriptions.
SoccerNet-GSR'24 Challenge			SoccerNet Game State Reconstruction (GSR) is a novel computer vision task involving the tracking and identification of sports players from a single moving camera to construct a video game-like minimap, without any specific hardware worn by the players. A new benchmark for Game State Reconstruction is introduced for this challenge, including a new dataset with 200 annotated soccer clips, a new evaluation metric, and a public baseline to serve as a starting point for the participants. Methods will be ranked according to their performance on the introduced metric on a held-out challenge set.

👁️ Vision Transformers

Title	Authors	Code / arXiv Page	Summary
Point Transformer V3: Simpler, Faster, Stronger	Xiaoyang Wu, Li Jiang, Peng-Shuai Wang		The Point Transformer V3 (PTv3) is a 3D point cloud transformer architecture that prioritizes simplicity and efficiency to enable scalability, overcoming the traditional trade-off between accuracy and speed in point cloud processing. It uses point cloud serialization, serialized attention, enhanced conditional positional encoding (xCPE), and simplified designs to improve efficiency. PTv3 achieves state-of-the-art performance across over 20 downstream tasks spanning indoor and outdoor scenarios, while offering superior speed and memory efficiency compared to previous point transformers.
RepViT: Revisiting Mobile CNN From ViT Perspective	Ao Wang, Hui Chen, Zijia Lin		RepViT is a new lightweight convolutional neural network series for mobile devices. It combines efficient architectural choices from Vision Transformers with a standard CNN, MobileNetV3-L. Key steps include separating token mixer and channel mixer, reducing expansion ratio, using early convolutions as stem, employing a deeper downsampling layer, replacing the classifier with a simpler one, and using only 3x3 convolutions. RepViT outperforms existing CNNs and ViTs on vision tasks while maintaining favorable latency on mobile devices.

👁️💬 Vision-Language

Title	Authors	Summary
Vlogger: Make Your Dream A Vlog	Shaobin Zhuang, Kunchang Li3, Xinyuan Chen	Vlogger is an AI system that generates minute-level video blogs from user descriptions. It uses a Large Language Model (LLM) to break down the task into four stages: Script, Actor, ShowMaker, and Voicer. The ShowMaker uses a Spatial-Temporal Enhanced Block (STEB) to enhance spatial-temporal coherence. Vlogger can generate 5+ minute vlogs surpassing previous long video generation methods.
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models	Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed	CLIP is a powerful vision-language model for visual recognition. However, fine-tuning it for small downstream tasks with limited labeled samples is challenging. Efficient transfer learning (ETL) methods adapt VLMs with few parameters, but require careful per-task hyperparameter tuning using large validation sets. To overcome this, the authors propose CLAP, a principled approach that adapts linear probing for few-shot learning. CLAP consistently outperforms ETL methods, providing an efficient and robust approach for few-shot adaptation of large vision-language models in realistic settings where hyperparameter tuning with large validation sets is not feasible.
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want	Zeyi Sun, Ye Fang, Tong Wu	Alpha-CLIP is an improved version of the CLIP model that focuses on specific regions of interest in images through an auxiliary alpha channel. It can enhance CLIP in different image-related tasks, including 2D and 3D image generation, captioning, and detection. Alpha-CLIP preserves CLIP's visual recognition ability and boosts zero-shot classification accuracy by 4.1% when using foreground masks.
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update	Zhi Gao, Yuntao Du, Xintong Zhang	CLOVA is a system that leverages large language models (LLMs) to generate programs that can accomplish various visual tasks using off-the-shelf visual tools. To overcome the limitation of fixed tools, CLOVA has a closed-loop framework that includes an inference phase, reflection phase, and learning phase. It also uses a multimodal global-local reflection scheme and three flexible methods to collect real-time training data. CLOVA's learning capability enables it to adapt to new environments, resulting in a 5-20% better performance on VQA, multiple-image reasoning, knowledge tagging, and image editing tasks.
Convolutional Prompting meets Language Models for Continual Learning	Anurag Roy, Riddhiman Moulick, Vinay K. Verma	The paper introduces ConvPrompt, a novel approach for continual learning in vision transformers. ConvPrompt leverages convolutional prompts and large language models to maintain layer-wise shared embeddings and improve knowledge sharing across tasks. The method improves state-of-the-art by around 3% with significantly fewer parameters. In summary, ConvPrompt is an efficient and effective prompt-based continual learning approach that adapts the model capacity based on task similarity.
Improved Visual Grounding through Self-Consistent Explanations	Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang	This paper presents a strategy called SelfEQ. The aim of SelfEQ is to improve the ability of vision-and-language models to locate specific objects in an image. The proposed strategy involves adding paraphrases generated by a large language model to existing text-image datasets. The model is then fine-tuned to ensure that a phrase and its paraphrase map to the same region in the image. This promotes self-consistency in visual explanations, expands the model's vocabulary, and enhances the quality of object locations highlighted by gradient-based visual explanation methods like GradCAM.
Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation	Ba Hung Ngo, Nhat-Tuong Do-Tran, Tuan-Ngoc Nguyen	The paper introduces a new approach called Explicitly Class-specific Boundaries (ECB) for domain adaptation, which combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) by training CNN on ViT. ECB uses ViT to determine class-specific decision boundaries and CNN to group target features based on those boundaries. This improves the quality of pseudo labels and reduces knowledge disparities. The paper also provides visualizations to demonstrate the effectiveness of the proposed ECB method.
Link-Context Learning for Multimodal LLMs	Yan Tai, Weichen Fan, Zhao Zhang	The paper presents Link-Context Learning (LCL), a new approach that enables Multimodal Large Language Models (MLLMs) to learn new concepts from limited examples in a single conversation. The proposed training strategy fine-tunes MLLMs using contrast learning and balanced sampling from LCL and original tasks. The ISEKAI dataset is introduced to evaluate MLLMs' performance on LCL tasks. Experiments show that LCL-MLLM outperforms vanilla MLLMs on the ISEKAI dataset. The paper presents LCL as a promising paradigm for expanding MLLMs' abilities and paving the way for more human-like learning in multimodal models.
Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations	Sangmin Lee, Bolin Lai, Fiona Ryan	The paper "Modeling Multimodal Social Interactions" introduces three new tasks to model multi-party social interactions. The authors propose a novel multimodal baseline that leverages densely aligned language-visual representations to address these challenges. Experiments demonstrate the effectiveness of the proposed approach in modeling social interactions.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	Qinghao Ye, Haiyang Xu, Jiabo Ye1	mPLUG-Owl2 is a multi-modal language model that improves text and multi-modal task performance. It uses a modularized network design with a language decoder as a universal interface for managing different modalities. It incorporates shared functional modules and a modality-adaptive module. It uses a two-stage training paradigm consisting of vision-language pre-training and joint vision-language instruction tuning. Experiments show it achieves SOTA results on multiple vision-language and pure-text benchmarks. It introduces novel architecture designs and training methods to enable modality collaboration, leading to strong performance in text-only and multi-modal tasks.
OneLLM: One Framework to Align All Modalities with Language	Jiaming Han, Kaixiong Gong, Yiyuan Zhang	OneLLM aligns 8 modalities to language using a unified framework. It uses a frozen CLIP-ViT and a universal projection module (UPM) that mixes image projection experts. OneLLM progressively aligns modalities to the LLM, starting with image-text alignment and expanding to video, audio, point cloud, depth/normal map, IMU, and fMRI data. The authors curated a large multimodal instruction dataset to fine-tune OneLLM's multimodal understanding and reasoning capabilities. OneLLM performs excellently on 25 diverse multimodal benchmarks, including captioning, question answering, and reasoning tasks. OneLLM pioneers a unified and scalable MLLM framework that can align a wide range of modalities with language and achieve strong multimodal understanding through instruction finetuning.
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation	Qidong Huang, Xiaoyi Dong, Pan Zhang	OPERA is a novel solution to alleviate hallucination in MLLMs. It introduces a penalty term and rollback strategy during beam-search decoding, targeting the root cause of self-attention patterns. It doesn't require additional data or training and has been proven effective in experiments.
Describing Differences in Image Sets with Natural Language	Lisa Dunlap, Yuhui Zhang, Xiaohan Wang
Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation	Shanshan Zhong, Zhongzhan Huang, Shanghua Gao
Osprey: Pixel Understanding with Visual Instruction Tuning		Osprey is an approach that improves multimodal large language models' accuracy in understanding visual information. It uses fine-grained mask regions and a convolutional CLIP backbone to extract precise visual mask features from high-resolution inputs efficiently. The authors curated the Osprey-724K dataset with 724K samples to facilitate mask-based instruction tuning. Osprey outperforms previous state-of-the-art methods in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning.

👩🏾‍🏫 Tutorial

Title	Authors	Code / arXiv Page	Summary
Diffusion-based Video Generative Models			The tutorial will provide an in-depth exploration of diffusion-based video generative models, a cutting-edge field that is transforming video creation. It aims to help students, researchers, practitioners, video creators and enthusiasts gain the necessary knowledge to enter and contribute to this domain. The tutorial will cover three main topics: (1) Fundamentals: Diffusion models, video foundation models, pre-training (2) Applications: Fine-tuning, editing, controls, personalization, motion customization(3) Evaluation & Safety: Benchmarks, metrics, attacks, watermarks, copyright protection
From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond			The MLLM Tutorial at CVPR 2024 reviews cutting-edge research in multimodal large language models (MLLMs). These models integrate various modalities to enable AI systems to understand, reason, and plan. The tutorial focuses on MLLM architecture design, instructional learning, and multimodal reasoning. The organizers have compiled an extensive reading list for LLMs, MLLMs, instruction tuning, and reasoning. The tutorial aims to summarize technical advancements, challenges, and future research directions in the evolving field of MLLMs.
Generalist Agent AI			The tutorial on Generalist Agent AI (GAA) provides a comprehensive overview of GAA systems that generate effective actions through multimodal sensory input. Led by experts from academia and industry, the tutorial covers areas such as embodied-multimodality, robotics, gaming, and healthcare. The schedule includes lectures, Q&A sessions, and panel discussions on knowledge agents, agent robotics, and agent foundation models.
Robustness at Inference: Towards Explainability, Uncertainty, and Intervenability			The tutorial focuses on creating neural networks that are robust, explainable, and have uncertainty quantification. It also covers intervenability by humans. Three key concepts covered are explainability, uncertainty, and intervenability. The tutorial covers various applications, such as image recognition, detecting anomalies, and image quality assessment.
Unlearning in Computer Vision: Foundations and Applications			Machine Unlearning (MU) is a new field in computer vision that focuses on removing specific data points, classes, or concepts from pre-trained models. The tutorial aims to provide a comprehensive understanding of MU techniques, algorithmic foundations and applications in computer vision. It also emphasizes the importance of MU from an industry perspective and discusses metrics to verify the unlearning process.
Recent Advances in Vision Foundation Models			The tutorial covers general-purpose vision systems, or vision foundation models, for various downstream tasks at different levels of granularity. It explores the synergy of tasks and the versatility of transformers for building models for multimodal understanding and generation.

📊 Datasets/Benchmarks

Title	Authors	Summary
Benchmarking and Evaluating Large Video Generation Models	Yaofang Liu, Xiaodong Cun, Xuebo Liu	The paper proposes a comprehensive evaluation framework for large video generation models that have grown rapidly. Existing academic metrics are inadequate for evaluating these models trained on massive datasets. The proposed evaluation pipeline comprises prompt curation, objective evaluation, subjective studies, and opinion alignment. The models are evaluated based on 17 objective metrics covering visual quality, content quality, motion quality, and text-caption alignment. Additionally, it provides a comparison table of various video generation models across different metrics and capabilities.
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object	Chenshuang Zhang, Fei Pan, Junmo Kim	ImageNet-D is a new benchmark for evaluating neural network robustness in visual perception tasks. It generates synthetic images with diverse backgrounds, textures, and materials, making it more challenging than other synthetic datasets. Key features include diversified image generation, high visual fidelity, and significant accuracy reduction of various vision models. The benchmark is created by combining object categories and refining through human verification. ImageNet-D is effective in evaluating neural network robustness, as accuracy on it improves with accuracy on ImageNet.
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs	Yunsheng Ma, Can Cui, Xu Cao	The LaMPilot dataset consists of 4,900 human-annotated traffic scenes, each with an instruction (I), an initial state (b), and a set of goal state criteria (G). The dataset is classified by maneuver and scenario types and is divided into training, validation, and testing sets.
MAPLM: A Real-World Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding	Xu Cao, Tong Zhou, Yunsheng Ma	The dataset contains 3D point cloud Bird's Eye View and high-resolution panoramic images of various traffic scenarios. It also includes detailed annotations at the feature, lane, and road levels. The dataset is designed for a Q&A task, where models will be evaluated based on their ability to answer questions about the traffic scenes such as the number of lanes, presence of intersections, and data quality.
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning	Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura	The Polaris dataset, used to train the model, contains 131,020 human judgments from 550 evaluators on the appropriateness of image captions. The dataset is much larger than existing ones and is capable of training image captioning metrics. The captions in Polaris are more diverse, collected from humans and generated by 10 modern image captioning models. This demonstrates the effectiveness and robustness of Polos compared to previous metrics.
VBench: Comprehensive Benchmark Suite for Video Generative Models	Ziqi Huang, Yinan He, Jiashuo Yu	VBench is a tool that evaluates video generation models across 16 quality dimensions. These dimensions fall under Video Quality and Video-Condition Consistency. VBench provides valuable insights by evaluating models across multiple dimensions, content categories, and comparing video vs image generation. The tool's authors plan to expand VBench to more models and video generation tasks. Checkout the leaderboard on HF here: https://huggingface.co/spaces/Vchitect/VBench_Leaderboard
SoccerNet Game State Reconstruction	Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Giancola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Baptiste Standaert, Amir Mohammad Mansourian, Xin Zhou, Shohreh Kasaei, Bernard Ghanem, Alexandre Alahi, Marc Van Droogenbroeck, Christophe De Vleeschouwer	SoccerNet Game State Reconstruction (GSR) is a novel computer vision task involving the tracking and identification of sports players from a single moving camera to construct a video game-like minimap, without any specific hardware worn by the players. SoccerNet-GSR, the released dataset, includes 200 clips with 9.37M pitch localization annotations and 2.36M athlete positions on the pitch with their role, team & jersey number. Furthermore, a new performance metric 'GS-HOTA' is introduced to evaluate GSR methods.

📦 3D Vision

Title	Authors	Summary
ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering	Haokai Pang, Heming Zhu	ASH generates real-time photorealistic renderings of animatable human avatars using Gaussian splats attached to a deformable mesh template. The skeletal motion is encoded using pose-dependent normal maps, and the dynamic Gaussian parameters are learned using 2D convolutional architectures. This approach surpasses existing real-time human avatar rendering methods and represents a significant step towards producing real-time, high-fidelity, controllable human avatars.
Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes	Hmrishav Bandyopadhyay, Subhadeep Koley, Ayan Das	This paper introduces a new method for generating precise 3D shapes from abstract freehand sketches, without the need for paired sketch-3D data. The approach uses a part-level modeling and alignment framework, which enables sketch modeling and in-position editing. By operating in a low-dimensional implicit latent space and using diffusion models, the approach significantly reduces computational demands and processing time. Overall, the method offers a novel solution for enabling accurate 3D generation from abstract sketches.
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians	Shenhan Qian, Tobias Kirschstein, Liam Schoneveld	GaussianAvatars is a new technique for creating customizable photorealistic head avatars using a dynamic 3D representation based on 3D Gaussian splats. This approach allows for precise animation control while maintaining photorealistic rendering. The technique has shown impressive animation capabilities in challenging scenarios, such as reenactments from a driving video, where it outperforms existing techniques by a significant margin.
Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships	Sebastian Koch, Narunas Vaskevicius	Open3DSG predicts open-vocabulary 3D scene graphs from point clouds, combining vision-language and large language models. Key ideas include constructing a 3D graph with a GNN, aligning features with CLIP, and using an LLM. It allows querying arbitrary objects and relationships at inference time, and enables open-vocabulary prediction not limited to fixed labels.

🧨 Diffusion

Title	Authors	Code / arXiv Page	Summary
Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features	Niladri Shekhar Dutt, Sanjeev Muralikrishnan, Niloy J. Mitra		Diff3F is a feature descriptor for untextured 3D shapes. It computes 3D semantic features using pre-trained 2D diffusion models, rendering depth and normal maps from multiple views, and lifting the 2D diffusion features back to the 3D surface. This produces semantic descriptors on the 3D shape without requiring additional training data or part segmentation.
One-step Diffusion with Distribution Matching Distillation	Tianwei Yin		Distribution Matching Distillation (DMD accelerates multi-step diffusion models into a one-step generator without compromising image quality. DMD matches the distribution of the original diffusion model by minimizing KL divergence and using two score functions - one for the actual data distribution and one for the generated distribution. A regression loss matches the large-scale structure of the multi-step diffusion outputs.

🧨Diffusion

Title	Authors	Summary
DemoFusion: Democratising High-Resolution Image Generation With No $$$	Ruoyi Du,Dongliang Chang, Timothy M. Hospedales, Yi-Zhe Song, Zhanyu Ma	DemoFusion is an extension that enables the generation of high-res images through an accessible and efficient inference procedure. It uses global-local denoising paths and introduces three techniques for coherent high-res generation: progressive upscaling, skip residual, and dilated sampling. DemoFusion unlocks the potential in existing open-source text-to-image models without additional training or prohibitive costs, democratizing high-res image synthesis.
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing	Yujun Shi, Chuhui Xue, Jiachun Pan	DragDiffusion is a novel method for interactive point-based image editing that enhances the applicability and versatility of the DragGAN framework by extending it to diffusion models. It optimizes the latent of a single diffusion step and introduces techniques to preserve the identity of the original image. The authors present DragBench, the first benchmark dataset for evaluating interactive point-based image editing methods. Experiments demonstrate the effectiveness of DragDiffusion compared to DragGAN, and an ablation study explores key factors.
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models	Shivangi Aneja, Justus Thies, Angela Dai	FaceTalk is a novel method to generate 3D motion sequences of talking human heads from audio signals. It employs neural parametric head models with speech signals and a new latent diffusion model. The approach denoises Gaussian noise sequences iteratively and extracts mesh sequences using marching cubes from the frozen NPHM model. FaceTalk outperforms existing methods by 75% in perceptual user study evaluations and produces visually natural motion with diverse facial expressions and styles.
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models	Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe	RAVE is a fast and innovative method for zero-shot video editing that uses pre-trained text-to-image diffusion models. It preserves the original motion and structure of the input video while producing high-quality, temporally consistent edited videos. RAVE edits videos 25% faster than existing methods by efficiently leveraging spatio-temporal interactions between frames. It outperforms existing methods across diverse editing scenarios and requires no extra training or manual inputs. However, there are some limitations such as flickering issues for extreme shape edits in very long videos and fine detail flickering. Try the demo here: https://huggingface.co/spaces/ozgurkara/RAVE
Relightful Harmonization: Lighting-aware Portrait Background Replacement	Mengwei Ren, Wei Xiong, Jae Shin Yoon	The paper presents Relightful Harmonization, a technique for harmonizing portrait lighting with a new background image. The method encodes lighting information from the target background image and aligns it with features from panoramic environment maps. Relightful Harmonization outperforms existing benchmarks in visual fidelity and lighting coherence. The technique only requires an arbitrary background image during inference and expands the training data using a novel data simulation pipeline. This approach enables realistic, lighting-aware portrait background replacement using just a single target background image, without requiring HDR environment maps.
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors	Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying Lee	SceneTex generates high-quality indoor scene textures using depth-to-image diffusion priors. Key features include optimization in RGB space, multiresolution texture field, and cross-attention decoder for global style consistency. Experiments show it outperforms prior methods, but limitations include occasional artifacts and inability to handle complex geometry.
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models	Lukas Höllein,Aljaž Božič, Norman Müller	ViewDiff generates 3D-consistent images from different viewpoints of the same object or scene. The approach involves enhancing the U-Net architecture of pretrained text-to-image models with new layers, training on real-world multi-view datasets using a denoising process, and producing multi-view consistent images of the same object in a single forward denoising pass. The results show high visual quality with improved 3D consistency compared to existing methods.

🧩 Segmentation

Title	Authors	Code / arXiv Page	Summary
Amodal Ground Truth and Completion in the Wild	Guanqi Zhan, Chuanxia Zheng, Weidi Xie		The paper introduces amodal image segmentation which predicts masks for entire objects, including occluded parts. Previous methods used manual annotation, but the authors use 3D data to construct the MP3D-Amodal dataset with authentic amodal ground truth masks. Two architecture variants are explored: a two-stage OccAmodal model and a one-stage SDAmodal model. Their method achieves state-of-the-art performance on amodal segmentation datasets, including COCOA and the new MP3D-Amodal dataset.

🛠️ Workshop

Title	Authors	Code / arXiv Page	Summary
Dataset Distillation			Full-day workshop on June 17. The workshop will explore the potential of Dataset Distillation (DD) in computer vision applications like face recognition, object detection, image segmentation, and video understanding. DD has the potential to reduce training costs, make AI eco-friendly, and enable research groups with limited resources to engage in state-of-the-art research. The workshop will also cover related topics such as active learning, few-shot learning, generative models, and learning from synthetic data.
Urban Scene Modeling: Where Vision Meets Photogrammetry and Graphics			Full day workshop on June 17th. The USM3D workshop unites researchers in computer vision, graphics, and photogrammetry to collaborate on urban scene modeling challenges. The event includes talks, presentations, a challenge, and a poster session.
What is Next in Multimodal Foundation Models?			Takes place the morning of June 18th. The workshop focuses on the emerging field of multimodal foundation models, which are trained on multiple modalities simultaneously and applied in text-to-image/video/3D generation, zero-shot classification, and cross-modal retrieval. It brings together leaders to discuss different aspects of these models, including their design, efficiency, ethics, and open availability.
Gaze Estimation and Prediction in the Wild			Morning of June 18th. The workshop will cover gaze-based interaction techniques, eye tracking technologies, applications of gaze interaction in various domains, and methodological considerations in gaze-based research. The main objective is to enhance the field of gaze interaction by providing a platform for researchers and practitioners to present their work, exchange ideas, and explore future directions.
Large Scale Holistic Video Understanding			The main objective of the workshop is to establish a video benchmark integrating joint recognition of all the semantic concepts, as a single class label per task is often not sufficient to describe the holistic content of a video. The planned panel discussion with world’s leading experts on this problem will be a fruitful input and source of ideas for all participants. The community is invited to help to extend the HVU dataset that will spur research in video understanding as a comprehensive, multi-faceted problem.
Representation Learning with Very Limited Images			Afternoon of June 18th. This workshop focuses on developing visual and multi-modal models with limited data resources. It aims to bring together diverse communities that work on approaches such as self-supervised learning with a single image or synthetic pre-training with generated images. The workshop's organizers include researchers from various institutions.
Responsible Data			Full day workshop on June 18. The Workshop on Responsible Data will discuss building responsible and inclusive datasets for computer vision. Topics include context-driven dataset development, best practices for data collectors and annotators, responsible datasets for AI models, measuring dataset responsibility, transparency, data privacy and accountability, and engaging the open-source community.
Synthetic data fro Computer Vision			The workshop focuses on the use of synthetic data for training and evaluating computer vision models. The workshop covers topics such as the effectiveness, efficiency, scalability, benchmarking, evaluation, risks, and ethical considerations of synthetic data in computer vision and related fields.
Computer Vision for Mixed Reality			VR has the potential to revolutionize our interactions. Passthrough techniques like Apple Vision Pro and Quest-3 allow for deeply immersive mixed reality experiences. We focus on capturing real environments with cameras and using AI to augment them with virtual objects. Our call for papers invites research on novel methods for Mixed Reality. Topics include real-time view synthesis, scene understanding, 3D capture, and more.
Multimodal Algorithmic Reasoning			Morning of June 17. The Multimodal Algorithmic Reasoning (MAR) 2024 workshop at CVPR 2024 aims to bring together researchers working on neural algorithmic learning, multimodal reasoning, and cognitive models of intelligence1. The workshop will focus on the emerging topic of multimodal algorithmic reasoning, where agents automatically deduce new algorithms for solving real-world tasks, and will also encourage the vision community to build neural networks with human-like intelligence abilities
Vision Datasets Understanding			The 3rd CVPR Workshop on Vision Datasets Understanding (VDU) is a gathering of experts in the field of analyzing vision datasets. The workshop will cover a variety of topics, including attributes and properties of vision datasets, dataset-level analysis, representations of and similarities between vision datasets, improving vision dataset quality, and evaluating model accuracy under various test environments.

municef1 / awesome-cvpr-2024