google / lyra

A Very Low-Bitrate Codec for Speech Compression

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Noise is being added to generated speech in Python E2E flow (TFLite Models)

barrylee111 opened this issue · comments

Description

I am currently working on a project that is built in Unity where I am modulating voices (e.g. source speech → voice modulator → target speech (elf)). I currently have an E2E flow with the TFLite models, but there is a decent amount of noise being added to the speech generation. It sounds almost like a clipping noise. I'm currently using the TFLite models from the repo and I have split the quantizer into a QuantizerEncoder & QuantizerDecoder. I'm not sure if a better solution is to attempt to convert Lyra into a DLL and just run that in Unity vs the models, but this is what I have so far.

E2E Flow

  • Load a wav file via librosa
  • Pad the data to meet data_length % 320 == 0
  • Feed the data through the 4 models: Encoder, QuantizerEncoder, QuantizerDecoder, & Decoder
  • Store the waveform data as I go as:
    • One singular array of data
    • A series of audio clips
  • Save the singular array of waveform data as a wav file
  • Play back the file

Code

!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/soundstream_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/lyragan.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_decoder.tflite

import tensorflow as tf
import numpy as np

import librosa

def getAudioData(audio_file, verbose=False):
    data, sr = librosa.load(audio_file, sr=None)
    
    if verbose:
        print(len(y))
        
    batch_size = 320
    padding_length = batch_size - (len(data) % batch_size)
    padded_data = np.pad(data, (0, padding_length), mode='constant', constant_values=0)
    
    return padded_data, sr

# Encoder:
def runEncoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="encoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Extract the first 320 samples
    input_data = np.array(input_data, dtype=input_details[0]['dtype'])
    input_data = np.reshape(input_data, (1, 320))

    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Quantizer Encoder:
def runQuantizerEncoderInference(input_data2, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="quantizer_encoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    input_data1 = np.array(46, dtype=np.int32)
    interpreter.set_tensor(input_details[0]['index'], input_data1)

    # input_data2 = np.ones(input_details[1]['shape'], dtype=input_details[1]['dtype'])
    interpreter.set_tensor(input_details[1]['index'], input_data2)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Quantizer Decoder:
def runQuantizerDecoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="quantizer_decoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Decoder:
def runDecoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="lyragan.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
        
    return output_data

audio_file = "<wavfile_path>.wav"
data, sr = getAudioData(audio_file)

# Check length and pad with zeroes so that the length % 320 = 0
import numpy as np

batch_size = 320
num_batches = len(data) // batch_size
waveform_data = None
audio_clips = None

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = (i + 1) * batch_size
    batch_data = data[start_idx:end_idx]

    enc_output = runEncoderInference(batch_data)
    qe_output = runQuantizerEncoderInference(enc_output)
    qd_output = runQuantizerDecoderInference(qe_output)
    dec_output = runDecoderInference(qd_output)
    
    if i == 0:
        waveform_data = dec_output[0] # Concatenates all waveform data
        audio_clips = dec_output # Stores waveform data as clips
    else:
        waveform_data = np.concatenate((waveform_data, dec_output[0]))
        audio_clips = np.concatenate((audio_clips, dec_output))

import torchaudio
import torch

audio_tensor = torch.tensor([waveform_data])
output_file = "<your_output_path>.wav"
torchaudio.save(output_file, audio_tensor, sr)

Questions

  • Is the better solution to create a DLL and run that in Unity?
  • Do the models encompass all of the pre & post-processing needed to provide a clean output signal that the cpp implementation provides (e.g. Integration Test Example)
  • Have I made an error in my implementation? I haven't been able to find a python implementation yet that runs the data and tests, so this is what I've come up with so far.
  • Is the noise possibly due to the fact that I am concatenating all of the data to test vs playing each clip back iteratively? I attempted to playback the output iteratively as plain waveform data in a Jupyter Notebook vs stored wavfiles, but no sound was produced.

Resources

Sound samples.zip