declare-lab / tango

A family of diffusion models for text-to-audio generation.

Home Page:https://tango2-web.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

After just using VAE reconstruct a audio, I only get noise

SuperiorDtj opened this issue · comments

Here is my code. Is there something wrong on my method about using vae?

`def recon_vae(self, filename):
        """ recon audio only by vae """
        with torch.no_grad():
        waveform, sample_rate = torchaudio.load(filename)
        waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=16000)[0]
        waveform = waveform - torch.mean(waveform)
        waveform = waveform / (torch.max(torch.abs(waveform)) + 1e-8)
        waveform = 0.5 * waveform
        waveform = waveform / torch.max(torch.abs(waveform))
        waveform = 0.5 * waveform
      
        #waveform = 0.5 * waveform[0:int(len(waveform)*1)]
        
        audio = torch.unsqueeze(waveform, 0)
        audio = torch.nan_to_num(torch.clip(audio, -1, 1))
        audio = torch.autograd.Variable(audio, requires_grad=False)
        melspec, log_magnitudes_stft, energy = self.stft.mel_spectrogram(audio)
        melspec = melspec.transpose(1, 2)
        melspec = melspec.unsqueeze(1)
        truth_lattent = self.vae.get_first_stage_encoding(self.vae.encode_first_stage(melspec))
       
        mel_recon = self.vae.decode_first_stage(truth_lattent)
        wave = self.vae.decode_to_waveform(mel_recon)
    return wave[0], waveform`

Can you try the folllowing:

import torch
import torchaudio
from tango import Tango
from tools.torch_tools import wav_to_fbank

filename = ... 

device = "cuda:0"
tango = Tango("declare-lab/tango", device)
tango.vae.eval()
tango.stft.eval()

duration = 10
target_length = int(duration * 102.4)

with torch.no_grad():
    mel, _, waveform = wav_to_fbank([filename], target_length, tango.stft)
    mel = mel.unsqueeze(1).to(device)
    latent = tango.vae.get_first_stage_encoding(tango.vae.encode_first_stage(mel))
    reconstructed_mel = tango.vae.decode_first_stage(latent)
    reconstructed_waveform = tango.vae.decode_to_waveform(reconstructed_mel)[0]

Can you try the folllowing:

import torch
import torchaudio
from tango import Tango
from tools.torch_tools import wav_to_fbank

filename = ... 

device = "cuda:0"
tango = Tango("declare-lab/tango", device)
tango.vae.eval()
tango.stft.eval()

duration = 10
target_length = int(duration * 102.4)

with torch.no_grad():
    mel, _, waveform = wav_to_fbank([filename], target_length, tango.stft)
    mel = mel.unsqueeze(1).to(device)
    latent = tango.vae.get_first_stage_encoding(tango.vae.encode_first_stage(mel))
    reconstructed_mel = tango.vae.decode_first_stage(latent)
    reconstructed_waveform = tango.vae.decode_to_waveform(reconstructed_mel)[0]

Thanks for your code!Now I can reconstruct the audio, but only in the situation that the number of the audio's frames is the multiple of four(3.6s dur instead of 3.7s dur)it can reconstruct the audio.
Is this commom issue of the VAE model?

What is the exact issue when reconstructing a 3.7s audio? Does it generate noise for the entire 3.7s or the last 0.1s?

What is the exact issue when reconstructing a 3.7s audio? Does it generate noise for the entire 3.7s or the last 0.1s?

When the VAE reconsturct a 3.7s audio, it generate noise for the entire 3.7s

I meet the same problem as u. Have the problem been solved? I tried making reconstruction on the same one audio smaple for several times, the reconstructed results are always very different noise. And the results of each reconstruction vary greatly from one another.

The only one solution is setting the duration like this?
target_length = int(duration * 102.4)