After just using VAE reconstruct a audio, I only get noise

Question

After just using VAE reconstruct a audio, I only get noise

SuperiorDtj opened this issue a year ago · comments

Here is my code. Is there something wrong on my method about using vae?

`def recon_vae(self, filename):
        """ recon audio only by vae """
        with torch.no_grad():

        waveform, sample_rate = torchaudio.load(filename)
        waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=16000)[0]
        waveform = waveform - torch.mean(waveform)
        waveform = waveform / (torch.max(torch.abs(waveform)) + 1e-8)
        waveform = 0.5 * waveform
        waveform = waveform / torch.max(torch.abs(waveform))
        waveform = 0.5 * waveform
      
        #waveform = 0.5 * waveform[0:int(len(waveform)*1)]
        
        audio = torch.unsqueeze(waveform, 0)
        audio = torch.nan_to_num(torch.clip(audio, -1, 1))
        audio = torch.autograd.Variable(audio, requires_grad=False)
        melspec, log_magnitudes_stft, energy = self.stft.mel_spectrogram(audio)
        melspec = melspec.transpose(1, 2)
        melspec = melspec.unsqueeze(1)
        truth_lattent = self.vae.get_first_stage_encoding(self.vae.encode_first_stage(melspec))
       
        mel_recon = self.vae.decode_first_stage(truth_lattent)
        wave = self.vae.decode_to_waveform(mel_recon)
    return wave[0], waveform`

Deepanway · Answer 1 · Fri Jun 02 2023 22:49:15 GMT+0800 (China Standard Time)

Can you try the folllowing:

import torch
import torchaudio
from tango import Tango
from tools.torch_tools import wav_to_fbank

filename = ... 

device = "cuda:0"
tango = Tango("declare-lab/tango", device)
tango.vae.eval()
tango.stft.eval()

duration = 10
target_length = int(duration * 102.4)

with torch.no_grad():
    mel, _, waveform = wav_to_fbank([filename], target_length, tango.stft)
    mel = mel.unsqueeze(1).to(device)
    latent = tango.vae.get_first_stage_encoding(tango.vae.encode_first_stage(mel))
    reconstructed_mel = tango.vae.decode_first_stage(latent)
    reconstructed_waveform = tango.vae.decode_to_waveform(reconstructed_mel)[0]

tianjiao du · Answer 2 · Mon Jun 05 2023 09:50:48 GMT+0800 (China Standard Time)

Can you try the folllowing:

import torch
import torchaudio
from tango import Tango
from tools.torch_tools import wav_to_fbank

filename = ... 

device = "cuda:0"
tango = Tango("declare-lab/tango", device)
tango.vae.eval()
tango.stft.eval()

duration = 10
target_length = int(duration * 102.4)

with torch.no_grad():
    mel, _, waveform = wav_to_fbank([filename], target_length, tango.stft)
    mel = mel.unsqueeze(1).to(device)
    latent = tango.vae.get_first_stage_encoding(tango.vae.encode_first_stage(mel))
    reconstructed_mel = tango.vae.decode_first_stage(latent)
    reconstructed_waveform = tango.vae.decode_to_waveform(reconstructed_mel)[0]

Thanks for your code！Now I can reconstruct the audio, but only in the situation that the number of the audio's frames is the multiple of four(3.6s dur instead of 3.7s dur)it can reconstruct the audio.
Is this commom issue of the VAE model?

Deepanway · Answer 3 · Tue Jun 06 2023 13:45:05 GMT+0800 (China Standard Time)

What is the exact issue when reconstructing a 3.7s audio? Does it generate noise for the entire 3.7s or the last 0.1s?

tianjiao du · Answer 4 · Tue Jun 06 2023 14:02:42 GMT+0800 (China Standard Time)

What is the exact issue when reconstructing a 3.7s audio? Does it generate noise for the entire 3.7s or the last 0.1s?

When the VAE reconsturct a 3.7s audio, it generate noise for the entire 3.7s

chave luv · Answer 5 · Sat Jul 29 2023 22:46:06 GMT+0800 (China Standard Time)

I meet the same problem as u. Have the problem been solved? I tried making reconstruction on the same one audio smaple for several times, the reconstructed results are always very different noise. And the results of each reconstruction vary greatly from one another.

The only one solution is setting the duration like this?
target_length = int(duration * 102.4)