=======================================================================================
https://drive.google.com/file/d/1O2SVVKvDBzkJIjRuwpEkwVDCCpO821XE/view?usp=drive_link
This task aims to separate the three fixed types of sound sources from the monaural mixture into three tracks, which are speech, music, and background noise. In detail, the output of the speech track is a normal speaking voice, and the music track signal defined here is a broad category, which may be full songs, vocals, and different accompaniments. Except for music and speech, any other possible background sounds, such as closing doors, animal calls, and some annoying noises, are classified as noise track signals.
In this project, we prepare three types of datasets for the generation of the mixed dataset. We also release a python script to preprocess these datasets and generate the MTASS dataset. You can download the needed source datasets from the below links.
Aishell-1 and Didi Speech, are used to build the speech source dataset of MTASS.
[Aishell-1]: (http://www.openslr.org/33/)
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, 2017, pp. 1–5.
[Didi Speech]: (https://outreach.didichuxing.com/research/opendata/)
Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, et al., “Didispeech: A large scale mandarin speech corpus,” in ICASSP, 2021, pp. 6968–6972.
The demixing secrets dataset (DSD100) of the Signal Separation Evaluation Campaign (SISEC) is used as the music source dataset of MTASS.
[DSD100]: (https://sisec.inria.fr/)
N. Ono, Z. Rafii, D. Kitamura, N. Ito, and A. Liutkus, “The 2015 Signal Separation Evaluation Campaign,” in International Conference on Latent Variable Analysis and Signal Separation, 2015, pp. 186–190.
The noise dataset of the Deep Noise Suppression (DNS) Challenge is used as the noise source dataset of MTASS.
[DNS-noise]: (https://github.com/microsoft/DNS-Challenge/)
Chandan K. A. Reddy, Harishchandra Dubey, Vishak Gopal, and et al, “ICASSP 2021 deep noise suppression challenge,” in ICASSP, 2021, pp. 6623–6627.
To tackle this challenging multi-task separation problem, we also proposed a baseline model to separate different tracks. Since this model works in the complex frequency domain for multi-task audio source separation, we call it “Complex-MTASSNet”. The Complex-MTASSNet separates the signal of each audio track in the complex domain, and further compensates the leaked residual signal for each track. The framework of this baseline model is shown in Figure 1. For more details of our model, please refer to our paper.
Fig 1. The framework of the proposed Complex-MTASSNet
In this multi-task separation, we have compared the proposed Complex-MTASSNet with several well-known baselines in speech enhancement, speech separation, and music source separation, which are GCRN, Conv-TasNet, Demucs, and D3Net.
Methods |
Para. (millions) |
MAC/S |
SDRi (dB) |
|||
Speech |
Music |
Noise |
Ave |
|||
GCRN |
9.88 M |
2.5 G |
9.11 |
5.76 |
5.51 |
6.79 |
Demucs |
243.32 M |
5.6 G |
9.93 |
6.38 |
6.29 |
7.53 |
D3Net |
7.93 M |
3.5 G |
10.55 |
7.64 |
7.79 |
8.66 |
Conv-TasNet |
5.14 M |
5.2 G |
11.80 |
8.35 |
8.07 |
9.41 |
Complex-MTASSNet |
28.18 M |
1.8 G |
12.57 |
9.86 |
8.42 |
10.28 |
ideal Speech | ideal Music | ideal Noise
Speech-GCRN | Music-GCRN | Noise-GCRN
Speech-Demucs | Music-Demucs | Noise-Demucs
Speech-D3Net | Music-D3Net | Noise-D3Net
Speech-Conv-TasNet | Music-Conv-TasNet | Noise-Conv-TasNet
Speech-Complex-MTASSNet | Music-Complex-MTASSNet | Noise-Complex-MTASSNet
ideal Speech | ideal Music | ideal Noise
Speech-GCRN | Music-GCRN | Noise-GCRN
Speech-Demucs | Music-Demucs | Noise-Demucs
Speech-D3Net | Music-D3Net | Noise-D3Net
Speech-Conv-TasNet | Music-Conv-TasNet | Noise-Conv-TasNet
Speech-Complex-MTASSNet | Music-Complex-MTASSNet | Noise-Complex-MTASSNet
ideal Speech | ideal Music | ideal Noise
Speech-GCRN | Music-GCRN | Noise-GCRN
Speech-Demucs | Music-Demucs | Noise-Demucs
Speech-D3Net | Music-D3Net | Noise-D3Net
Speech-Conv-TasNet | Music-Conv-TasNet | Noise-Conv-TasNet
Speech-Complex-MTASSNet | Music-Complex-MTASSNet | Noise-Complex-MTASSNet
ideal Speech | ideal Music | ideal Noise
Speech-GCRN | Music-GCRN | Noise-GCRN
Speech-Demucs | Music-Demucs | Noise-Demucs
Speech-D3Net | Music-D3Net | Noise-D3Net
Speech-Conv-TasNet | Music-Conv-TasNet | Noise-Conv-TasNet
Speech-Complex-MTASSNet | Music-Complex-MTASSNet | Noise-Complex-MTASSNet
ideal Speech | ideal Music | ideal Noise
Speech-GCRN | Music-GCRN | Noise-GCRN
Speech-Demucs | Music-Demucs | Noise-Demucs
Speech-D3Net | Music-D3Net | Noise-D3Net
Speech-Conv-TasNet | Music-Conv-TasNet | Noise-Conv-TasNet
Speech-Complex-MTASSNet | Music-Complex-MTASSNet | Noise-Complex-MTASSNet
The MTASS project contains two files: a model constrcution file and a dataset generation file.
After downloading the original datasets, follow the instructions in /dataset_generation/readme.txt to generate your own MTASS dataset.
Or you can directly download our prepared datasets:
Website:https://pan.baidu.com/s/1FN2nIWyfAEmlJnX5_HYekQ
Code:d22v
--run.py:
This is the main file of the whole project, which is the main function for the feature extraction, model training and model testing.
--DNN_models:
It contains a model file and its solver file, including the model information, feature extraction, model training and testing...
--utils:
It contains a utils library file, which has many audio data processing functions for processing numpy data format.
--train_data:
This file will be created to store the extracted training features and labels.
--dev_data:
This file will be created to store the extracted development features and labels.
--model_parameters:
This file will be created to store the saved model files(.pth).
If you use this code for your research, please consider citing:
@article{zhang2021multi,
title={Multi-Task Audio Source Separation},
author={Zhang, Lu and Li, Chenxing and Deng, Feng and Wang, Xiaorui},
booktitle={2021 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
year={2021},
pages={671-678},
organization={IEEE}
}
All rights reserved by Lu Zhang (zhanglu_wind@163.com).
If you have any questions, please contact Lu Zhang (zhanglu_wind@163.com) and Chenxing Li (lichenxing007@gmail.com).
People shall use the source code here only for non-commercial research and educational purposes.
2022-01-25
We will open source the training code and pre-trained model in the next two weeks. If you would like to get pre-trained models earlier, or for more details, please contact us directly.