This repo summarizes the tutorials, datasets, papers, codes and tools for speech separation and speaker extraction task. You are kindly invited to pull requests.
-
[Speech Separation, Hung-yi Lee, 2020] [Video (Subtitle)] [Video] [Slide]
-
[Advances in End-to-End Neural Source Separation, Yi Luo, 2020] [Video (BiliBili)] [Video] [Slide]
-
[Audio Source Separation and Speech Enhancement, Emmanuel Vincent, 2018] [Book]
-
[Audio Source Separation, Shoji Makino, 2018] [Book]
-
[Overview Papers] [Paper (Daniel Michelsanti)] [Paper (DeLiang Wang)] [Paper (Bo Xu)] [Paper (Zafar Rafii)] [Paper (Sharon Gannot)]
-
[Overview Slides] [Slide (DeLiang Wang)] [Slide (Haizhou Li)] [Slide (Meng Ge)]
-
[Hand Book] [Ongoing]
-
[Dataset Intruduciton] [Pure Speech Dataset Slide (Meng Ge)] [Audio-Visual Dataset Slide (Zexu Pan)]
-
[WSJ0] [Dataset]
-
[WSJ0-2mix] [Script]
-
[WSJ0-2mix-extr] [Script]
-
[WHAM & WHAMR] [Paper (WHAM)] [Paper (WHAMR)] [Dataset]
-
[SparseLibriMix] [Script]
-
[VCTK-2Mix] [Script]
-
[CHIME5 & CHIME6 Challenge] [Dataset]
-
[AudioSet] [Dataset]
-
[Microsoft DNS Challenge] [Dataset]
-
[AVSpeech] [Dataset]
-
[LRW] [Dataset]
-
[LRS2] [Dataset]
-
[VoxCeleb] [Dataset]
-
[Attentional Selection in a Cocktail Party Environment Can Be Decoded from Single-Trial EEG, James, Cerebral Cortex 2012] [Paper]
-
[Selective cortical representation of attended speaker in multi-talker speech perception, Nima Mesgarani, Nature 2012] [Paper]
-
[Neural decoding of attentional selection in multi-speaker environments without access to clean sources, James, Journal of Neural Engineering 2017] [Paper]
-
[Speech synthesis from neural decoding of spoken sentences, Gopala K. Anumanchipalli, Nature 2019] [Paper]
-
[Towards reconstructing intelligible speech from the human auditory cortex, HassanAkbari, Scientific Reports 2019] [Paper] [Code]
-
[Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation, Po-Sen Huang, TASLP 2015] [Paper] [Code (posenhuang)]
-
[Complex Ratio Masking for Monaural Speech Separation, DS Williamson, TASLP 2015] [Paper]
-
[Deep clustering: Discriminative embeddings for segmentation and separation, JR Hershey, ICASSP 2016] [Paper] [Code (Kai Li)] [Code (Jian Wu)] [Code (asteroid)]
-
[Single-channel multi-speaker separation using deep clustering, Y Isik, Interspeech 2016] [Paper] [Code (Kai Li)] [Code (Jian Wu)]
-
[Permutation invariant training of deep models for speaker-independent multi-talker speech separation, Dong Yu, ICASSP 2017] [Paper] [Code (Kai Li)] [Code (Sining Sun)]
-
[Recognizing Multi-talker Speech with Permutation Invariant Training, Dong Yu, ICASSP 2017] [Paper]
-
[Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, M Kolbæk, TASLP 2017] [Paper] [Code (Kai Li)]
-
[Deep attractor network for single-microphone speaker separation, Zhuo Chen, ICASSP 2017] [Paper] [Code (Kai Li)]
-
[Alternative Objective Functions for Deep Clustering, Zhong-Qiu Wang, ICASSP 2018] [Paper]
-
[Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation, Jing Shi, IJCAI 2018] [Paper]
-
[End-to-End Speech Separation with Unfolded Iterative Phase Reconstructioni, Zhong-Qiu Wang et al. 2018] [Paper]
-
[Modeling Attention and Memory for Auditory Selection in a Cocktail Party Environment, Jiaming Xu, AAAI 2018] [Paper] [Code]
-
[Speaker-independent Speech Separation with Deep Attractor Network, Luo Yi, TASLP 2018] [Paper] [Code (Kai Li)]
-
[Listening to Each Speaker One by One with Recurrent Selective Hearing Networks, Keisuke Kinoshita, ICASSP 2018] [Paper]
-
[Tasnet: time-domain audio separation network for real-time, single-channel speech separation, Luo Yi, ICASSP 2018] [Paper] [Code (Kai Li)] [Code (asteroid)]
-
[Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation, Luo Yi, TASLP 2019] [Paper] [Code (Kai Li)] [Code (asteroid)]
-
[Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation, Yuzhou Liu, TASLP 2019] [Paper] [Code] [Code]
-
[Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering, Gene-Ping Yang, Interspeech 2019] [Paper] [Code]
-
[Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation, Luo Yi, Arxiv 2019] [Paper] [Code (Kai Li)]
-
[A comprehensive study of speech separation: spectrogram vs waveform separation, Fahimeh Bahmaninezhad, Interspeech 2019] [Paper]
-
[Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features, Cunhang Fan, Interspeech 2019] [Paper]
-
[Interrupted and cascaded permutation invariant training for speech separation, Gene-Ping Yang, ICASSP, 2020][Paper]
-
[FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks, Liwen Zhang, MMM 2020] [Paper]
-
[Filterbank design for end-to-end speech separation, Manuel Pariente et al., ICASSP 2020] [Paper]
-
[Voice Separation with an Unknown Number of Multiple Speakers, Eliya Nachmani, Arxiv 2020] [Paper] [Demo]
-
[AN EMPIRICAL STUDY OF CONV-TASNET, Berkan Kadıoglu , Arxiv 2020] [Paper] [Code]
-
[Voice Separation with an Unknown Number of Multiple Speakers, Eliya Nachmani, Arxiv 2020] [Paper]
-
[Wavesplit: End-to-End Speech Separation by Speaker Clustering, Neil Zeghidour et al. Arxiv 2020 ] [Paper]
-
[La Furca: Iterative Context-Aware End-to-End Monaural Speech Separation Based on Dual-Path Deep Parallel Inter-Intra Bi-LSTM with Attention, Ziqiang Shi, Arxiv 2020] [Paper]
-
[Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method, Cunhang Fan, Arxiv 2020] [Paper]
-
[Identify Speakers in Cocktail Parties with End-to-End Attention, Junzhe Zhu, Arxiv 2018] [Paper] [Code]
-
[Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals, Jing Shi, Arxiv 2020] [Paper] [Code/Demo]
-
[Speaker-Conditional Chain Model for Speech Separation and Extraction, Jing Shi, Arxiv 2020] [Paper] [Code/Demo]
-
[Improving Voice Separation by Incorporating End-to-end Speech Recognition, Naoya Takahashi, ICASSP 2020] [Paper] [Code]
-
[A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet, David Ditter, ICASSP 2020] [Paper] [Code]
-
[Two-Step Sound Source Separation: Training on Learned Latent Targets, Efthymios Tzinis, ICASSP 2020] [Paper] [Code (Asteroid)] [Code (Tzinis)]
-
[Unsupervised Sound Separation Using Mixtures of Mixtures, Scott Wisdom, Arxiv] [Paper]
-
[Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss, Ziqiang Shi, 2020] [Paper]
-
[Deep Audio-Visual Learning: A Survey, Hao Zhu, Arxiv 2020] [Paper]
-
[Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks, Jen-Cheng Hou, TETCI 2017] [Paper] [Code]
-
[The Sound of Pixels, Hang Zhao, ECCV 2018] [Paper/Demo]
-
[Learning to Separate Object Sounds by Watching Unlabeled Video, Ruohan Gao, ECCV 2018] [Paper]
-
[The Conversation: Deep Audio-Visual Speech Enhancement, Triantafyllos Afouras, Interspeech 2018] [Paper]
-
[End-to-end audiovisual speech recognition, Stavros Petridis, ICASSP 2018] [Paper] [Code]
-
[Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation, ARIEL EPHRAT, ACM Transactions on Graphics 2018] [Paper] [Code]
-
[Learning to Separate Object Sounds by Watching Unlabeled Video, Ruohan Gao, ECCV 2018] [Paper]
-
[Time domain audio visual speech separation, Jian Wu, Arxiv 2019] [Paper]
-
[Co-Separating Sounds of Visual Objects, Ruohan Gao, ICCV 2019] [Paper]
-
[Recursive Visual Sound Separation Using Minus-Plus Net, Xudong Xu, ICCV 2019] [Paper]
-
[The Sound of Motions, Hang Zhao, ICCV 2019] [Paper]
-
[Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network, Ke Tan, Arxiv 2019] [Paper]
-
[Co-Separating Sounds of Visual Objects, Ruohan Gao, ICCV 2019] [Paper] [Code]
-
[Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments, Giovanni Morrone, Arxiv 2019] [Paper] [Code]
-
[Music Gesture for Visual Sound Separation, Chuang Gao, CVPR 2020] [Paper]
-
[FaceFilter: Audio-visual speech separation using still images, Soo-Whan Chung, Arxiv 2020] [Paper]
-
[Awesome Audio-Visual, Github, Kranti Kumar Parida] [Github Link]
-
[FaSNet: Low-latency Adaptive Beamforming for Multi-microphone Audio Processing, Yi Luo , Arxiv 2019] [Paper]
-
[MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition, Xuankai Chang et al., ASRU 2020] [Paper]
-
[End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation, Yi Luo et al., ICASSP 2020] [Paper] [Code]
-
[Enhancing End-to-End Multi-channel Speech Separation via Spatial Feature Learning, Rongzhi Guo, ICASSP 2020] [Paper]
-
[Multi-modal Multi-channel Target Speech Separation, Rongzhi Guo, J-STSP 2020] [Paper]
-
[Single channel target speaker extraction and recognition with speaker beam, Marc Delcroix, ICASSP 2018] [Paper]
-
[VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking, Quan Wang, INTERSPEECH 2018] [Paper] [Code (Jian Wu)]
-
[Single-Channel Speech Extraction Using Speaker Inventory and Attention Network, Xiong Xiao et al, ICASSP 2019] [Paper]
-
[Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss, Chenglin Xu, ICASSP 2019] [Paper] [Code]
-
[Time-domain speaker extraction network, Chenglin Xu, ASRU 2019] [Paper]
-
[SpEx: Multi-Scale Time Domain Speaker Extraction Network, Chenglin Xu, TASLP 2020] [Paper]
-
[Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam, Marc Delcroix, ICASSP 2020] [Paper]
-
[SpEx+: A Complete Time Domain Speaker Extraction Network, Meng Ge, Arxiv 2020] [Paper] [Code]
- [Asteroid: the PyTorch-based audio source separation toolkit for researchers, Manuel Pariente et al., ICASSP 2020] [Tool Link]
-
[Performance measurement in blind audio sourceseparation, Emmanuel Vincent et al., TASLP 2004] [Paper] [Tool Link]
-
[SDR – Half-baked or Well Done?, Jonathan Le Roux, ICASSP 2019] [Paper] [Tool Link]
Speech separation (SS) and speaker extraction (SE) on the WSJ0-2mix (8k, min) dataset.
Task | Methods | Model Size | SDRi | SI-SDRi |
---|---|---|---|---|
SS | DPCL++ | 13.6M | - | 10.8 |
SS | uPIT-BLSTM-ST | 92.7M | 10.0 | - |
SS | DANet | 9.1M | - | 10.5 |
SS | cuPIT-Grid-RD | 53.2M | 10.2 | - |
SS | SDC-G-MTL | 53.9M | 10.5 | - |
SS | CBLDNN-GAT | 39.5M | 11.0 | - |
SS | Chimera++ | 32.9M | 12.0 | 11.5 |
SS | WA-MISI-5 | 32.9M | 13.1 | 12.6 |
SS | BLSTM-TasNet | 23.6M | 13.6 | 13.2 |
SS | Conv-TasNet | 5.1M | 15.6 | 15.3 |
SE | SpEx | 10.8M | 17.0 | 16.6 |
SE | SpEx+ | 11.1M | 17.6 | 17.4 |
SS | DeepCASA | 12.8M | 18.0 | 17.7 |
SS | FurcaNeXt | 51.4M | 18.4 | - |
SS | DPRNN-TasNet | 2.6M | 19.0 | 18.8 |
SS | Wavesplit | - | 19.2 | 19.0 |
SS | Wavesplit + Dynamic mixing | - | 20.6 | 20.4 |