Multi-voice singing voice synthesis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance bottleneck (not from model)

opened this issue · comments

When I've call the first time '', i've notice the long preprocessing time to generate the hdf5. Approximatly 3 hours on my computer to generate the 96 hdf5 files. I've notice the sp_to_mgc performance bottleneck (SPTK dependency).

To produce a 2m54 song (the Elton John one from the NUS database), my computer need more than 13 minutes. 10 minutes more than the song duration. I've think that it's because I call the model on my CPU (not GPU), but i've do some measurements and found that the problem is clearly not the model and the 'AI' part.

The inference call:

import models
import config

file_name = 'nus_JLEE_sing_15.hdf5'

singer_name = 'MPOL'
singer_index = config.singers.index(singer_name)

model = models.WGANSing()
model.test_file_hdf5_no_question(file_name, singer_index)

The test_file_hdf5_no_question is just the same as test_file_hdf5 without the questions, but with function timing measurment and only the synthesized audio generated (not the ground truth)

The timing result (in seconds)

- load_model [*]   :   2.7976150512695
- read_hdf5_file   :   0.0341496467590
- process_file [*] :   3.0663671493530
- feats_to_audio   : 770.0193181037903

[*] Tensorflow calls

Clearly, the AI part is very fast, even on CPU. The problem come from the audio regeneration.

Details of feats_to_audio calls (always in seconds)

- f0_to_hertz   :   0.0130412578582
- mfsc_to_mgc   :   0.7175555229187
- mgc_to_sp     : 737.2016060352325
- pw.synthesize :  25.4196729660034
- sf.write      :   0.7051553726196

The PyWorld synthesize call is acceptable with 25 seconds (14% of the global audio duration), but the SPTK call is not.

Sadly, to my knowledge, this is the only fast code (C code) to generate Mel-Generalized Cepstrum conversion. And this is not a question of GPU because this is a pure CPU code. What the hell with this algorithm ?!?

I know my computer is a oldskool one : Dell Workstation T7400 with an Intel Xeon 4 cores @ 2.33GHz and 16GB RAM. But it works very well for many things except the pure Deep Learning stuffs.

I don't know if something it's possible in the future with WGANSing because the MGC is in the heart of the project, but I will investigate to find a way to optimize this process. I'm sure it's possible to reduce the computation time with some tricks.

In any case, well done with WGANSing, love that kind of project !


hi, to generate singing voice, it expects a .hdf5 file from the dataset. Generated .hdf5 needs wave file, Can it not use wave files?


I have the same experience with you, where sp_to_mfsc in data preparation and mfsc_to_mgc in inference is time consuming, and some hyper-parameter like 0.45 in sp_to_mfcs also make me confused. Maybe melGAN can work better than mfsc_to_mgc to convert spectrum into signal.