A collections of papers related to the topics of joint learning of vision with speech, audio or music (audio-visual learning) in CVPR 2024. For folks who work in the field related to audio-visual learning, computer audition, speech/audio/music, this repo gives a brief summary of the accepted papers (main conference) along with their code and/or dataset. More information will be updated for works that are accepted in workshops or demos.
- Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion Chhatre, Kiran and Daněček, Radek and Athanasiou, Nikos and Becherini, Giorgio and Peters, Christopher and Black, Michael J. and Bolkart, Timo [code]
- ES³: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations Yuanhang, Zhang and Shuang, Yang and Shiguang, Shan and Xilin, Chen
- Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks Methods and Applications Karren D. Yang and Anurag, Ranjan and Jen-Hao Rick Chang and Raviteja, Vemulapalli and Oncel, Tuzel
- DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation Junming Chen, Yunfei Liu, Jianan, Wang and Ailing, Zeng and Yu, Li and Qifeng, Chen [code]
- Faces that Speak: Jointly Synthesising Talking Face and Speech from Text Youngjoon, Jang and Ji-Hoon, Kim and Junseok, Ahn and Doyeop, Kwak and Hong-Sun, Yang and Yoon-Cheol, Ju and Il-Hwan, Kim and Byeong-Yeol, Kim and Joon Son Chung
- Towards Variable and Coordinated Holistic Co-Speech Motion Generation Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, Changxing Ding [code]
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis Muhammad Hamza Mughal and Rishabh, Dabral and Ikhsanul, Habibie and Lucia Donatelli and Marc, Habermann and Christian Theobalt [code]
- A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition Yusheng, Dai and Hang, Chen and Jun, Du and Xiaofei, Ding and Ning, Ding and Feijun, Jiang and Chin-Hui, Lee [code]
- AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation Jeongsoo, Choi and Se Jin Park and Minsu, Kim and Yong Man Ro [code], CVPR Highlight
- Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model He, Xu and Huang, Qiaochu and Zhang, Zhensong and Lin, Zhiwei and Wu, Zhiyong and Yang, Sicheng and Li, Minglei and Chen, Zhiyi and Xu, Songcen and Wu, Xiaofei [code]
- Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation Tianshui, Chen and Jianman, Lin and Zhijing, Yang and Chunmei, Qing and Liang, Lin [code] CVPR Highlight
- EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling Haiyang, Liu and Zihao, Zhu and Giorgio, Becherini and Yichen, Peng and Mingyang, Su and You, Zhou and Xuefei, Zhe and Naoya, Iwamoto and Bo, Zheng and Michael J. Black [code]
- Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation Xingqun, Qi and Jiahao, Pan and Peng, Li and Ruibin, Yuan and Xiaowei, Chi and Mengfei, Li and Wenhan, Luo and Wei, Xue and Shanghang, Zhang and Qifeng, Liu and Yike, Guo
- Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos Sagnik, Majumder and Ziad, Al-Halah and Kristen, Grauman
- QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition Xiang, Li and Jinglu, Wang and Xiaohao, Xu and Xiulian, Peng and Rita, Singh and Yan, Lu and Bhiksha, Raj
- UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition Ding, Xiaohan and Zhang, Yiyuan and Ge, Yixiao and Zhao, Sijie and Song, Lin and Yue, Xiangyu and Shan, Ying [code]
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action Jiasen, Lu and Christopher, Clark and Sangho, Lee and Zichen, Zhang and Savya, Khosla and Ryan, Marten and Derek, Hoiem and Aniruddha, Kembhavi [code]
- CrossMAE: Cross-Modality Masked Autoencoders for Region-Aware Audio-Visual Pre-Training Yuxin, Guo and Siyang, Sun and Shuailei, Ma and Kecheng, Zheng and Xiaoyi, Bao and Shijie, Ma and Wei, Zou and Yun, Zheng
- Looking Similar Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning Nikhil, Singh and Chih-Wei, Wu and Iroro, Orife and Mahdi, Kalayeh
- Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation Yuanhong, Chen and Yuyuan, Liu and Hu, Wang and Fengbei, Liu and Chong, Wang and Helen, Frazer and Gustavo, Carneiro
- RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation Zeyuan, Yang and Jiageng, Liu and Peihao, Chen and Anoop, Cherian and Tim K. Marks and Jonathan Le Roux and Chuang, Gan
- DiVAS: Video and Audio Synchronization with Dynamic Frame Rates Clara Fernandez-Labrador and Mertcan, Akçay and Eitan, Abecassis and Joan, Massich and Christopher, Schroers
- Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners Xing, Yazhou and He, Yingqing and Tian, Zeyue and Wang, Xintao and Chen, Qifeng [code]
- AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection Trevine, Oorloff and Surya, Koppisetti and Nicolò, Bonettini and Divyaraj, Solanki and Ben, Colman and Yaser, Yacoob and Ali, Shahriyari and Gaurav, Bharaj
- AV-RIR: Audio-Visual Room Impulse Response Estimation Anton, Ratnarajah and Sreyan, Ghosh and Sonal, Kumar and Purva, Chiniya and Dinesh, Manocha [code]
- DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction Junwen, Xiong and Peng, Zhang and Tao, You and Chuanyue, Li and Wei, Huang and Yufei, Zha [code]
- Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling Shentong, Mo and Pedro, Morgado [code]
- Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling Kranthi Kumar Rachavarapu and Kalyan, Ramakrishnan and Rajagopalan A. N.
- The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective Wenqi, Jia and Miao, Liu and Hao, Jiang and Ishwarya, Ananthabhotla and James M. Rehg and Vamsi Krishna Ithapu and Ruohan, Gao
- Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation Qi, Yang and Xing, Nie and Tong, Li and Pengfei, Gao and Ying, Guo and Cheng, Zhen and Pengfei, Yan and Shiming, Xiang [code] CVPR Highlight
- FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models Shivangi, Aneja and Justus, Thies and Angela, Dai and Matthias, Nießner [code]
- Audio-Visual Segmentation via Unlabeled Frame Exploitation Jinxiang, Liu and Yikun, Liu and Fei, Zhang and Chen, Ju and Ya, Zhang and Yanfeng, Wang
- Cyclic Learning for Binaural Audio Generation and Localization Zhaojian, Li and Bin, Zhao and Yuan, Yuan
- From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations Evonne, Ng and Javier, Romero and Timur, Bagautdinov and Shaojie, Bai and Trevor, Darrell and Angjoo, Kanazawa and Alexander, Richard [code]
- FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio Chao, Xu and Yang, Liu and Jiazheng, Xing and Weida, Wang and Mingze, Sun and Jun, Dan and Tianxin, Huang and Siyuan, Li and Zhi-Qi, Cheng and Ying, Tai and Baigui, Sun [code]
- Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos Chen, Liu and Peike Patrick Li and Qingtao, Yu and Hongwei, Sheng and Dadong, Wang and Lincheng, Li and Xin, Yu
- Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark Ziyang, Chen and Israel D. Gebru and Christian, Richardt and Anurag, Kumar and William, Laney and Andrew, Owens and Alexander, Richard [code] CVPR Highlight
- TIM: A Time Interval Machine for Audio-Visual Action Recognition Jacob, Chalk and Jaesung, Huh and Evangelos, Kazakos and Andrew, Zisserman and Dima, Damen [code]
- Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language Mark, Hamilton and Andrew, Zisserman and John R. Hershey and William T. Freeman [code]
- T-VSL: Text-Guided Visual Sound Source Localization in Mixtures Tanvir, Mahmud and Yapeng, Tian and Diana, Marculescu [code]
- Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge Dongjin, Kim and Sung Jin Um and Sangmin, Lee and Jung Uk Kim [code]
- SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos Changan, Chen and Kumar, Ashutosh and Rohit Girdhar and David, Harwath and Kristen, Grauman
- SonicVisionLM: Playing Sound with Vision Language Models Zhifeng, Xie and Shengye, Yu and Qile, He and Mengtian, Li [code]
- CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation Xi, Liu and Ying, Guo and Cheng, Zhen and Tong, Li and Yingying, Ao and Pengfei, Yan
- Hearing Anything Anywhere Mason Long Wang and Ryosuke, Sawata and Samuel, Clarke and Ruohan, Gao and Shangzhe, Wu and Jiajun, Wu [code]
- Diff-BGM: A Diffusion Model for Video Background Music Generation Sizhe, Li and Yiming, Qin and Minghang, Zheng Xin Jin and Yang, Liu [code]
- MuseChat: A Conversational Music Recommendation System for Videos Zhikang, Dong and Xiulong, Liu and Bin, Chen and Pawel, Polak and Peng, Zhang [code] CVPR Highlight
- MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models Sanjoy, Chowdhury and Sayan, Nag and K J Joseph and Balaji Vasan Srinivasan and Dinesh, Manocha [code] CVPR Highlight
- DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance Zixuan, Wang and Jia, Jia and Shikun, Sun and Haozhe, Wu and Rong, Han and Zhenyu, Li and Di, Tang and Jiaqing, Zhou and Jiebo, Luo [code]
Please feel free to submit issues or email (xl1995@uw.edu) to add/modify links or correct wrong information.