Performance bottleneck (not from model)
opened this issue · comments
When I've call the first time 'prep_data_nus.py
', i've notice the long preprocessing time to generate the hdf5. Approximatly 3 hours on my computer to generate the 96 hdf5 files. I've notice the sp_to_mgc
performance bottleneck (SPTK dependency).
To produce a 2m54 song (the Elton John one from the NUS database), my computer need more than 13 minutes. 10 minutes more than the song duration. I've think that it's because I call the model on my CPU (not GPU), but i've do some measurements and found that the problem is clearly not the model and the 'AI' part.
The inference call:
import models
import config
file_name = 'nus_JLEE_sing_15.hdf5'
singer_name = 'MPOL'
singer_index = config.singers.index(singer_name)
model = models.WGANSing()
model.test_file_hdf5_no_question(file_name, singer_index)
The test_file_hdf5_no_question
is just the same as test_file_hdf5
without the questions, but with function timing measurment and only the synthesized audio generated (not the ground truth)
The timing result (in seconds)
- load_model [*] : 2.7976150512695
- read_hdf5_file : 0.0341496467590
- process_file [*] : 3.0663671493530
- feats_to_audio : 770.0193181037903
[*] Tensorflow calls
Clearly, the AI part is very fast, even on CPU. The problem come from the audio regeneration.
Details of feats_to_audio
calls (always in seconds)
- f0_to_hertz : 0.0130412578582
- mfsc_to_mgc : 0.7175555229187
- mgc_to_sp : 737.2016060352325
- pw.synthesize : 25.4196729660034
- sf.write : 0.7051553726196
The PyWorld synthesize call is acceptable with 25 seconds (14% of the global audio duration), but the SPTK call is not.
Sadly, to my knowledge, this is the only fast code (C code) to generate Mel-Generalized Cepstrum conversion. And this is not a question of GPU because this is a pure CPU code. What the hell with this algorithm ?!?
I know my computer is a oldskool one : Dell Workstation T7400 with an Intel Xeon 4 cores @ 2.33GHz and 16GB RAM. But it works very well for many things except the pure Deep Learning stuffs.
I don't know if something it's possible in the future with WGANSing because the MGC is in the heart of the project, but I will investigate to find a way to optimize this process. I'm sure it's possible to reduce the computation time with some tricks.
In any case, well done with WGANSing, love that kind of project !
hi, to generate singing voice, it expects a .hdf5 file from the dataset. Generated .hdf5 needs wave file, Can it not use wave files?
I have the same experience with you, where sp_to_mfsc
in data preparation and mfsc_to_mgc
in inference is time consuming, and some hyper-parameter like 0.45 in sp_to_mfcs
also make me confused. Maybe melGAN can work better than mfsc_to_mgc
to convert spectrum into signal.