Pitch Controllable DDSP Vocoders

The repository is a collection of relatively high fidelity, fast, easy trained pitch controllable ddsp vocoders modified from the below repositorys:

https://github.com/magenta/ddsp

https://github.com/YatingMusic/ddsp-singing-vocoders

1. Installing the dependencies

We recommend first installing PyTorch from the official website, then run:

pip install -r requirements.txt

2. Preprocessing

Put all the training dataset (.wav format audio clips) in the below directory: data/train/audio. Put all the validation dataset (.wav format audio clips) in the below directory: data/val/audio. Then run

python preprocess.py -c configs/full.yaml

for a model of hybrid additive synthesis and subtractive synthesis, or run

python preprocess.py -c configs/sins.yaml

for a model of additive synthesis only, or run

python preprocess.py -c configs/sawsub.yaml

for a model of substractive synthesis only.

You can modify the configuration file config/<model_name>.yaml before preprocessing. The default configuration during training is 44.1khz sampling rate audio for about a few hours and GTX1660 graphics card.

3. Training

# train a full model as an example
python train.py -c configs/full.yaml

The command line for training other models is similar.

You can safely interrupt training, then running the same command line will resume training.

You can also finetune the model if you interrupt training first, then re-preprocess the new dataset or change the training parameters (batchsize, lr etc.) and then run the same command line.

4. Visualization

# check the training status using tensorboard
tensorboard --logdir=exp

5. Copy-synthesising or pitch-shifting test

# Copy-synthesising test
# wav -> mel, f0 -> wav
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)>

# Pitch-shifting test
# wav -> mel, f0 -> mel (unchaned), f0 (shifted) -> wav
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <key(semitones)>

6. Some suggestions for the model choice

It is recommended to try the "Full" model first, which generally has a low multi-scaled-stft loss and relatively good quality when applying a pitch shift.

However, this loss sometimes cannot reflect the subjective sense of hearing.

If the "Full" model does not work well, it is recommended to switch to the "Sins" model.

The "Sins" model works also well when applying copy synthesis, but it changes the formant when applying a pitch shift, which changes the timbre.

The "SawSub" model is not recommended due to artifacts in unvoiced phonemes, although it probably has the best formant invariance in pitch-shifting cases.

7. Comments on the sound quality

For the seen speaker, the sound quality of a well-trained ddsp vocoder will be better than that of the world vocoder or griffin-lim vocoder, and it can also compete with the gan-based vocoder when the total amount of data is relatively small. But for a large amount of data, the upper limit of sound quality will be lower than that of generative model-based vocoders.

For the unseen speaker, the performance may be unsatisfactory.

splinter21 / pc-ddsp