Noise is being added to generated speech in Python E2E flow (TFLite Models)
barrylee111 opened this issue · comments
Barry Lee commented
Description
I am currently working on a project that is built in Unity where I am modulating voices (e.g. source speech → voice modulator → target speech (elf)). I currently have an E2E flow with the TFLite models, but there is a decent amount of noise being added to the speech generation. It sounds almost like a clipping noise. I'm currently using the TFLite models from the repo and I have split the quantizer into a QuantizerEncoder & QuantizerDecoder. I'm not sure if a better solution is to attempt to convert Lyra into a DLL
and just run that in Unity vs the models, but this is what I have so far.
E2E Flow
- Load a wav file via librosa
- Pad the data to meet data_length % 320 == 0
- Feed the data through the 4 models:
Encoder
,QuantizerEncoder
,QuantizerDecoder
, &Decoder
- Store the waveform data as I go as:
- One singular array of data
- A series of audio clips
- Save the singular array of waveform data as a wav file
- Play back the file
Code
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/soundstream_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/lyragan.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_decoder.tflite
import tensorflow as tf
import numpy as np
import librosa
def getAudioData(audio_file, verbose=False):
data, sr = librosa.load(audio_file, sr=None)
if verbose:
print(len(y))
batch_size = 320
padding_length = batch_size - (len(data) % batch_size)
padded_data = np.pad(data, (0, padding_length), mode='constant', constant_values=0)
return padded_data, sr
# Encoder:
def runEncoderInference(input_data, verbose=False):
interpreter = tf.lite.Interpreter(model_path="encoder.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Extract the first 320 samples
input_data = np.array(input_data, dtype=input_details[0]['dtype'])
input_data = np.reshape(input_data, (1, 320))
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
if verbose:
print(output_data)
return output_data
# Quantizer Encoder:
def runQuantizerEncoderInference(input_data2, verbose=False):
interpreter = tf.lite.Interpreter(model_path="quantizer_encoder.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_data1 = np.array(46, dtype=np.int32)
interpreter.set_tensor(input_details[0]['index'], input_data1)
# input_data2 = np.ones(input_details[1]['shape'], dtype=input_details[1]['dtype'])
interpreter.set_tensor(input_details[1]['index'], input_data2)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
if verbose:
print(output_data)
return output_data
# Quantizer Decoder:
def runQuantizerDecoderInference(input_data, verbose=False):
interpreter = tf.lite.Interpreter(model_path="quantizer_decoder.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
if verbose:
print(output_data)
return output_data
# Decoder:
def runDecoderInference(input_data, verbose=False):
interpreter = tf.lite.Interpreter(model_path="lyragan.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
if verbose:
print(output_data)
return output_data
audio_file = "<wavfile_path>.wav"
data, sr = getAudioData(audio_file)
# Check length and pad with zeroes so that the length % 320 = 0
import numpy as np
batch_size = 320
num_batches = len(data) // batch_size
waveform_data = None
audio_clips = None
for i in range(num_batches):
start_idx = i * batch_size
end_idx = (i + 1) * batch_size
batch_data = data[start_idx:end_idx]
enc_output = runEncoderInference(batch_data)
qe_output = runQuantizerEncoderInference(enc_output)
qd_output = runQuantizerDecoderInference(qe_output)
dec_output = runDecoderInference(qd_output)
if i == 0:
waveform_data = dec_output[0] # Concatenates all waveform data
audio_clips = dec_output # Stores waveform data as clips
else:
waveform_data = np.concatenate((waveform_data, dec_output[0]))
audio_clips = np.concatenate((audio_clips, dec_output))
import torchaudio
import torch
audio_tensor = torch.tensor([waveform_data])
output_file = "<your_output_path>.wav"
torchaudio.save(output_file, audio_tensor, sr)
Questions
- Is the better solution to create a
DLL
and run that in Unity? - Do the models encompass all of the pre & post-processing needed to provide a clean output signal that the cpp implementation provides (e.g.
Integration Test Example
) - Have I made an error in my implementation? I haven't been able to find a python implementation yet that runs the data and tests, so this is what I've come up with so far.
- Is the noise possibly due to the fact that I am concatenating all of the data to test vs playing each clip back iteratively? I attempted to playback the output iteratively as plain waveform data in a Jupyter Notebook vs stored wavfiles, but no sound was produced.