Open-Unmix for PyTorch

This repository contains the PyTorch (1.0+) implementation of Open-Unmix, a deep neural network reference implementation for music source separation, applicable for researchers, audio engineers and artists. Open-Unmix provides ready-to-use models that allow users to separate pop music into four stems: vocals, drums, bass and the remaining other instruments. The models were pre-trained on the MUSDB18 dataset. See details at apply pre-trained model.

We also provide implementations for tensorflow and nnabla.

The Model

Open-Unmix is based on a three-layer bidirectional deep LSTM. The model learns to predict the magnitude spectrogram of a target, like vocals, from the magnitude spectrogram of a mixture input. Internally, the prediction is obtained by applying a mask on the input. The model is optimized in the magnitude domain using mean squared error and the actual separation is done in a post-processing step involving a multichannel wiener filter implemented using norbert. To perform separation into multiple sources, multiple models are trained for each particular target. While this makes the training less comfortable, it allows great flexibility to customize the training data for each target source.

Input Stage

Open-Unmix operates in the time-frequency domain to perform its prediction. The input of the model is either:

A time domain signal tensor of shape (nb_samples, nb_channels, nb_timesteps), where nb_samples are the samples in a batch, nb_channels is 1 or 2 for mono or stereo audio, respectively, and nb_timesteps is the number of audio samples in the recording.

In that case, the model computes spectrograms with torch.STFT on the fly.

Alternatively open-unmix also takes magnitude spectrograms directly (e.g. when pre-computed and loaded from disk).

In that case, the input is of shape (nb_frames, nb_samples, nb_channels, nb_bins), where nb_frames and nb_bins are the time and frequency-dimensions of a Short-Time-Fourier-Transform.

The input spectrogram is standardized using the global mean and standard deviation for every frequency bin across all frames. Furthermore, we apply batch normalization in multiple stages of the model to make the training more robust against gain variation.

Dimensionality reduction

The LSTM is not operating on the original input spectrogram resolution. Instead, in the first step after the normalization, the network learns to compresses the frequency and channel axis of the model to reduce redundancy and make the model converge faster.

Bidirectional-LSTM

The core of open-unmix is a three layer bidirectional LSTM network. Due to its recurrent nature, the model can be trained and evaluated on arbitrary length of audio signals. Since the model takes information from past and future simultaneously, the model cannot be used in an online/real-time manner. An uni-directional model can easily be trained as described here.

Output Stage

After applying the LSTM, the signal is decoded back to its original input dimensionality. In the last steps the output is multiplied with the input magnitude spectrogram, so that the models is asked to learn a mask.

Separation

Since PyTorch currently lacks an invertible STFT, the synthesis is performed in numpy. For inference, we rely on an implementation of a multichannel Wiener filter that is a very popular way of filtering multichannel audio for several applications, notably speech enhancement and source separation. The norbert module assumes to have some way of estimating power-spectrograms for all the audio sources (non-negative) composing a mixture.

Getting started

Installation

For installation we recommend to use the Anaconda python distribution. To create a conda environment for open-unmix, simply run:

conda env create -f environment-X.yml where X is either [cpu-linux, gpu-cuda10, cpu-osx], depending on your system. For now, we haven't tested windows support.

Applying the pre-trained model on audio files

To separate audio files (wav, flac, ogg) files just run:

python test.py input_file.wav

Additionally --model umx can be used to load a different pre-trained models, we currently support the following:

umxhq (default) is trained on MUSDB18-HQ which comprises the same tracks as in MUSDB18 but un-compressed which yield in a full bandwidth of 22050 Hz.
umx is trained on the regular MUSDB18 which is bandlimited to 16 kHz do to AAC compression. This model should be used for comparison with other (older) methods for evaluation in SiSEC18.

We provide a notebook on google colab to experiment with open-unmix and to separate files online without any installation setup.

Separation Parameters

The separation can be controlled with additional parameters that influence the performance of the separation

Command line Argument	Description	Default
`--targets list(str)`	Targets to be used for separation. For each target a model file with with same name is required.	`['vocals', 'drums', 'bass', 'other']`
`--softmask`	if activated, then the initial estimates for the sources will be obtained through a ratio mask of the mixture STFT, and not by using the default behavior of reconstructing waveforms by using the mixture phase.	not set
`--niter <int>`	Number of EM steps for refining initial estimates in a post-processing stage. `--niter 0` skips this step altogether. More iterations can get better interference reduction at the price of more artifacts.	`1`
`--alpha <float>`	In case of softmasking, this value changes the exponent to use for building ratio masks. A smaller value usually leads to more interference but better perceptual quality, whereas a larger value leads to less interference but an "overprocessed" sensation.	`1.0`

Load user-trained models

python test.py --model /path/to/model/root/directory input_file.wav

Note that model usually contains individual models for each target and performs separation using all models. E.g. if model_path contains vocals and drums models, two output files are generated.

Evaluation using `museval`

To perform evaluation in comparison to other SISEC systems, you would need to install the museval package using

pip install museval

and then run the evaluation using

python eval.py --outdir /path/to/musdb/estimates --evaldir /path/to/museval/results

Results compared to SiSEC 2018 (SDR/Vocals)

Open-Unmix yields state-of-the-art results compared to participants from SiSEC 2018. The performance of UMXHQ and UMX is almost identical since it was evaluated on compressed STEMS.

Note that

[STL1, STL2, TAK2, TAK3, TAU1, UHL3] used additional training datasets which is why we didn't list them here.
[HEL1, TAK1, UHL1, UHL2] are not open-source.

Scores (Median of frames, Median of tracks)

target	SDR	SIR	SAR	ISR	SDR	SIR	SAR	ISR
`model`	UMX	UMX	UMX	UMX	UMXHQ	UMXHQ	UMXHQ	UMXHQ
vocals	6.32	13.33	6.52	11.93	6.25	12.95	6.50	12.70
bass	5.23	10.93	6.34	9.23	5.07	10.35	6.02	9.71
drums	5.73	11.12	6.02	10.51	6.04	11.65	5.93	11.17
other	4.02	6.59	4.74	9.31	4.28	7.10	4.62	8.78

Training

Details on the training is provided in a separate document here.

Extensions

Details on how open-unmix can be extended or improved for future research on music separation is described in a separate document here.

Design Choices / Contributions

we favored simplicity over performance to promote clearness of the code. The rationale is to have open-unmix serve as a baseline for future research while performance still meets current state-of-the-art (See Evaluation). The results are comparable/better to those of UHL1/UHL2 which obtained the best performance over all systems trained on MUSDB18 in the SiSEC 2018 Evaluation campaign.
We designed the code to allow researchers to reproduce existing results, quickly develop new architectures and add own user data for training and testing. We favored framework specifics implementations instead of having a monolithic repository.
open-unmix is a community focused project, we therefore encourage the community to submit bug-fixes and comments and improve the computational performance. However, we are not looking for changes that only focused on improving the performance.

Authors

Fabian-Robert Stöter, Antoine Liutkus, Inria and LIRMM, Montpellier, France

References

If you use open-unmix for your research – Cite Open-Unmix

@article{stoter19,  
  author={F.-R. St\\"oter and S. Uhlich and A. Liutkus and Y. Mitsufuji},  
  title={Open-unmix: a reference implementation for source separation},  
  journal={Journal of Open-Source Research},  
  year=2019,  
  note={submitted}}
}

If you use the MUSDB dataset for your research - Cite the MUSDB18 Dataset

@misc{MUSDB18,
  author       = {Rafii, Zafar and
                  Liutkus, Antoine and
                  Fabian-Robert St{\"o}ter and
                  Mimilakis, Stylianos Ioannis and
                  Bittner, Rachel},
  title        = {The {MUSDB18} corpus for music separation},
  month        = dec,
  year         = 2017,
  doi          = {10.5281/zenodo.1117372},
  url          = {https://doi.org/10.5281/zenodo.1117372}
}

If compare your results with SiSEC 2018 Participants - Cite the SiSEC 2018 LVA/ICA Paper

@inproceedings{SiSEC18,
  author="St{\"o}ter, Fabian-Robert and Liutkus, Antoine and Ito, Nobutaka",
  title="The 2018 Signal Separation Evaluation Campaign",
  booktitle="Latent Variable Analysis and Signal Separation:
  14th International Conference, LVA/ICA 2018, Surrey, UK",
  year="2018",
  pages="293--305"
}

⚠️ Please note that the official acronym for open-unmix is UMX.

License

MIT

suwoncjh / open-unmix-pytorch

Open-Unmix for PyTorch

The Model

Input Stage

Dimensionality reduction

Bidirectional-LSTM

Output Stage

Separation

Getting started

Installation

Applying the pre-trained model on audio files

Separation Parameters

Load user-trained models

Evaluation using `museval`

Results compared to SiSEC 2018 (SDR/Vocals)

Scores (Median of frames, Median of tracks)

Training

Extensions

Design Choices / Contributions

Authors

References

License

About

Languages

Open-Unmix for PyTorch

The Model

Input Stage

Dimensionality reduction

Bidirectional-LSTM

Output Stage

Separation

Getting started

Installation

Applying the pre-trained model on audio files

Separation Parameters

Load user-trained models

Evaluation using museval

Results compared to SiSEC 2018 (SDR/Vocals)

Scores (Median of frames, Median of tracks)

Training

Extensions

Design Choices / Contributions

Authors

References

License

About

Languages

Evaluation using `museval`