Multi-Task Audio Source Separation (MTASS)

=======================================================================================

0. Pretrained model

https://drive.google.com/file/d/1O2SVVKvDBzkJIjRuwpEkwVDCCpO821XE/view?usp=drive_link

1. Task Description

This task aims to separate the three fixed types of sound sources from the monaural mixture into three tracks, which are speech, music, and background noise. In detail, the output of the speech track is a normal speaking voice, and the music track signal defined here is a broad category, which may be full songs, vocals, and different accompaniments. Except for music and speech, any other possible background sounds, such as closing doors, animal calls, and some annoying noises, are classified as noise track signals.

2. MTASS Dataset Preparation

In this project, we prepare three types of datasets for the generation of the mixed dataset. We also release a python script to preprocess these datasets and generate the MTASS dataset. You can download the needed source datasets from the below links.

2.1. Speech Datasets

Aishell-1 and Didi Speech, are used to build the speech source dataset of MTASS.

[Aishell-1]: (http://www.openslr.org/33/)

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, 2017, pp. 1–5.

[Didi Speech]: (https://outreach.didichuxing.com/research/opendata/)

Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, et al., “Didispeech: A large scale mandarin speech corpus,” in ICASSP, 2021, pp. 6968–6972.

2.2. Music Datasets

The demixing secrets dataset (DSD100) of the Signal Separation Evaluation Campaign (SISEC) is used as the music source dataset of MTASS.

[DSD100]: (https://sisec.inria.fr/)

N. Ono, Z. Rafii, D. Kitamura, N. Ito, and A. Liutkus, “The 2015 Signal Separation Evaluation Campaign,” in International Conference on Latent Variable Analysis and Signal Separation, 2015, pp. 186–190.

2.3. Noise Datasets

The noise dataset of the Deep Noise Suppression (DNS) Challenge is used as the noise source dataset of MTASS.

[DNS-noise]: (https://github.com/microsoft/DNS-Challenge/)

Chandan K. A. Reddy, Harishchandra Dubey, Vishak Gopal, and et al, “ICASSP 2021 deep noise suppression challenge,” in ICASSP, 2021, pp. 6623–6627.

3. MTASS Baseline Model

To tackle this challenging multi-task separation problem, we also proposed a baseline model to separate different tracks. Since this model works in the complex frequency domain for multi-task audio source separation, we call it “Complex-MTASSNet”. The Complex-MTASSNet separates the signal of each audio track in the complex domain, and further compensates the leaked residual signal for each track. The framework of this baseline model is shown in Figure 1. For more details of our model, please refer to our paper.

                                  Fig 1. The framework of the proposed Complex-MTASSNet

3.1. Comparison with other models

In this multi-task separation, we have compared the proposed Complex-MTASSNet with several well-known baselines in speech enhancement, speech separation, and music source separation, which are GCRN, Conv-TasNet, Demucs, and D3Net.

Methods	Para. (millions)	MAC/S	SDRi (dB)
Methods	Para. (millions)	MAC/S	Speech	Music	Noise	Ave
GCRN	9.88 M	2.5 G	9.11	5.76	5.51	6.79
Demucs	243.32 M	5.6 G	9.93	6.38	6.29	7.53
D3Net	7.93 M	3.5 G	10.55	7.64	7.79	8.66
Conv-TasNet	5.14 M	5.2 G	11.80	8.35	8.07	9.41
Complex-MTASSNet	28.18 M	1.8 G	12.57	9.86	8.42	10.28

3.2. Listening Demos

4. Usage

The MTASS project contains two files: a model constrcution file and a dataset generation file.

4.1 dataset_generation

After downloading the original datasets, follow the instructions in /dataset_generation/readme.txt to generate your own MTASS dataset. Or you can directly download our prepared datasets:
Website：https://pan.baidu.com/s/1FN2nIWyfAEmlJnX5_HYekQ
Code：d22v

4.2 model_construction

--run.py:
This is the main file of the whole project, which is the main function for the feature extraction, model training and model testing.

--DNN_models:
It contains a model file and its solver file, including the model information, feature extraction, model training and testing...

--utils:
It contains a utils library file, which has many audio data processing functions for processing numpy data format.

--train_data:
This file will be created to store the extracted training features and labels.

--dev_data:
This file will be created to store the extracted development features and labels.

--model_parameters:
This file will be created to store the saved model files(.pth).

5. Citation

If you use this code for your research, please consider citing:

@article{zhang2021multi,
title={Multi-Task Audio Source Separation},
author={Zhang, Lu and Li, Chenxing and Deng, Feng and Wang, Xiaorui},
booktitle={2021 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
year={2021},
pages={671-678},
organization={IEEE}
}

6. Copyright and Authors

If you have any questions, please contact Lu Zhang (zhanglu_wind@163.com) and Chenxing Li (lichenxing007@gmail.com).

People shall use the source code here only for non-commercial research and educational purposes.

7. TODO

2022-01-25

We will open source the training code and pre-trained model in the next two weeks. If you would like to get pre-trained models earlier, or for more details, please contact us directly.

Windstudent / Complex-MTASSNet

Multi-Task Audio Source Separation (MTASS)

0. Pretrained model

1. Task Description

2. MTASS Dataset Preparation

2.1. Speech Datasets

2.2. Music Datasets

2.3. Noise Datasets

3. MTASS Baseline Model

3.1. Comparison with other models

3.2. Listening Demos

(1) Listening Demo 1

(2) Listening Demo 2

(3) Listening Demo 3

(4) Listening Demo 4

(4) Listening Demo 5