Noise is being added to generated speech in Python E2E flow (TFLite Models)

Question

Noise is being added to generated speech in Python E2E flow (TFLite Models)

barrylee111 opened this issue 7 months ago · comments

Description

I am currently working on a project that is built in Unity where I am modulating voices (e.g. source speech → voice modulator → target speech (elf)). I currently have an E2E flow with the TFLite models, but there is a decent amount of noise being added to the speech generation. It sounds almost like a clipping noise. I'm currently using the TFLite models from the repo and I have split the quantizer into a QuantizerEncoder & QuantizerDecoder. I'm not sure if a better solution is to attempt to convert Lyra into a DLL and just run that in Unity vs the models, but this is what I have so far.

E2E Flow

Load a wav file via librosa
Pad the data to meet data_length % 320 == 0
Feed the data through the 4 models: Encoder, QuantizerEncoder, QuantizerDecoder, & Decoder
Store the waveform data as I go as:
- One singular array of data
- A series of audio clips
Save the singular array of waveform data as a wav file
Play back the file

Code

!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/soundstream_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/lyragan.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_decoder.tflite

import tensorflow as tf
import numpy as np

import librosa

def getAudioData(audio_file, verbose=False):
    data, sr = librosa.load(audio_file, sr=None)
    
    if verbose:
        print(len(y))
        
    batch_size = 320
    padding_length = batch_size - (len(data) % batch_size)
    padded_data = np.pad(data, (0, padding_length), mode='constant', constant_values=0)
    
    return padded_data, sr

# Encoder:
def runEncoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="encoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Extract the first 320 samples
    input_data = np.array(input_data, dtype=input_details[0]['dtype'])
    input_data = np.reshape(input_data, (1, 320))

    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Quantizer Encoder:
def runQuantizerEncoderInference(input_data2, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="quantizer_encoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    input_data1 = np.array(46, dtype=np.int32)
    interpreter.set_tensor(input_details[0]['index'], input_data1)

    # input_data2 = np.ones(input_details[1]['shape'], dtype=input_details[1]['dtype'])
    interpreter.set_tensor(input_details[1]['index'], input_data2)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Quantizer Decoder:
def runQuantizerDecoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="quantizer_decoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Decoder:
def runDecoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="lyragan.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
        
    return output_data

audio_file = "<wavfile_path>.wav"
data, sr = getAudioData(audio_file)

# Check length and pad with zeroes so that the length % 320 = 0
import numpy as np

batch_size = 320
num_batches = len(data) // batch_size
waveform_data = None
audio_clips = None

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = (i + 1) * batch_size
    batch_data = data[start_idx:end_idx]

    enc_output = runEncoderInference(batch_data)
    qe_output = runQuantizerEncoderInference(enc_output)
    qd_output = runQuantizerDecoderInference(qe_output)
    dec_output = runDecoderInference(qd_output)
    
    if i == 0:
        waveform_data = dec_output[0] # Concatenates all waveform data
        audio_clips = dec_output # Stores waveform data as clips
    else:
        waveform_data = np.concatenate((waveform_data, dec_output[0]))
        audio_clips = np.concatenate((audio_clips, dec_output))

import torchaudio
import torch

audio_tensor = torch.tensor([waveform_data])
output_file = "<your_output_path>.wav"
torchaudio.save(output_file, audio_tensor, sr)

Questions

Is the better solution to create a DLL and run that in Unity?
Do the models encompass all of the pre & post-processing needed to provide a clean output signal that the cpp implementation provides (e.g. Integration Test Example)
Have I made an error in my implementation? I haven't been able to find a python implementation yet that runs the data and tests, so this is what I've come up with so far.
Is the noise possibly due to the fact that I am concatenating all of the data to test vs playing each clip back iteratively? I attempted to playback the output iteratively as plain waveform data in a Jupyter Notebook vs stored wavfiles, but no sound was produced.

Resources

Sound samples.zip