google / lyra

A Very Low-Bitrate Codec for Speech Compression

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Correct tflite model usage/pipeline?

josephrocca opened this issue · comments

My current guess/understanding is:

  1. Get 320 samples of a 16 khz audio file
  2. Feed that into soundstream_encoder.tflite as float32[1,320] and get float32[1,1,64] data as output.
  3. (???) Somehow quantize, transmit over network, and then dequantize using quantizer.tflite? It has an input shape of int32[46,1,1], so I'm not sure what to do with that.
  4. Feed the output of dequantization process (float32[1,1,64]) to lyragan.tflite to produce the reconstructed samples (float32[1,320])

I'm not able to test this guess/understanding right now because I'm blocked on some issues, but I'm aiming to get a minimal open source web demo working using either tfjs-tflite, or ONNX Runtime Web, so if anyone could correct my understanding here, that would be great!

Hi there, that sounds interesting. Your understanding is basically correct.

I would recommend you look at our example in encoder_main.cc for steps 1, 2, and 3. It produces a .lyra file which contains the quantized data that would be transmitted over the network. decoder_main.cc does the dequantize/reconstruction step.

These examples run on files (faster than real-time). Some work would be needed to use this in a real-time streaming application.

I am excited about the web demo! Please keep us posted about your progress.
I would agree that your understanding is mostly correct, particularly about the soundstream_encoder (samples to features) and lyragan (features to samples) models. I would add that the quantizer is a single model with 2 signatures: encode (features to bits) and decode (bits to features).
The 46 you are seeing in the shape of the quantizer interface comes from the maximum number of quantizers, each with 4 bits. That's a total of 46x4=184bits every 20ms, which makes it 9.2kbps. For 3.2kbps and 6kbps only 16 and 30 quantizers are used respectively.
You can see how the 2 signatures are loaded and called in the Residual Vector Quantizer class.

Ahh, okay, I see now in netron.app that the .tflite file has two subgraphs - missed that! 😅 Thanks also for explaining the 46/30/16 stuff @aluebs.

I'll post the demo here or in a new issue when I've got it working.

(Also wanted to say thanks to both of you and all involved in working on this and making it open source! LyraV2's performance is incredible, and it's so cool that it's freely available for all to use 🙏)

Our pleasure. Codecs are about enabling communications. And that is best achieved through interoperability and collaboration.
And it is always great to see cool applications built on top, so I am personally excited about that web demo :)

Hello, thanks for making the tflite files available. I'm able to run soundstream_encoder with a 320-dim input array and get a 64-dim feature, but don't know how to run the quantizer. Here's roughly what I did:

x = np.ones(320)
encoder = tf.lite.Interpreter('soundstream_encoder.tflite').get_signature_runner()
encoder_out = encoder(input_audio=x)['bottleneck_1/simpleconv']
encoder_quantizer = tf.lite.Interpreter('quantizer.tflite').get_signature_runner('encode')
encoder_quantizer(input_frames=???, num_quantizers=8)

But how should I process encoder_out with it? Feeding encoder_out to input_frames doesn't seem to work.

I have not tried running these TFLite files from python, passing encoder_out as input_frames seems reasonable to me. How does it fail? Any error messages? Beware that the only officially supported values for num_quantizers are 16, 30 and 46.

Thanks for replying. I figured it out: num_quantizers should be an array instead of a single number.

The input and output of the quantizer don't seem to match.

x = np.ones(320, dtype=np.float32)
encoder = tf.lite.Interpreter(ENCODER_PATH).get_signature_runner()
encoder_quantizer = tf.lite.Interpreter(QUANTIZER_PATH).get_signature_runner('encode')
decoder_quantizer = tf.lite.Interpreter(QUANTIZER_PATH).get_signature_runner('decode')
encoder_out = encoder(input_audio=x)['bottleneck_1/simpleconv']
encoder_quantizer_out = encoder_quantizer(input_frames=encoder_out, num_quantizers=np.array([46]))['output_0']
decoder_quantizer_out = decoder_quantizer(encoding_indices=encoder_quantizer_out)['output_0']
print(encoder_out)
print(decoder_quantizer_out)

Output:
[[[ 4.6768 5.2019 30.9693 29.1799 -13.9479 -20.2321 -18.9173
-2.0216 71.4695 1.2933 -41.781 -9.7017 14.1581 41.5235
14.7401 3.9903 15.8061 26.8866 23.3161 -9.2959 -24.1347
0.1228 7.6399 -2.2619 -16.2647 2.8482 -12.5169 7.9989
12.2995 -39.2526 18.2285 -16.1468 2.0148 11.51 19.276
0.6773 -4.8266 -8.45 -5.6571 26.0052 -2.7498 -28.4975
-32.0878 0.2983 37.0367 -28.8171 -6.4624 13.8729 6.2805
7.6458 -5.3856 -12.0877 1.0219 -6.0388 -1.9798 -2.223
2.0584 -9.4123 1.2139 -9.2483 5.4903 -7.4411 5.5858
5.0048]]]
[[[ 1.5479 -0.9323 13.378 12.6163 -6.5922 -8.9712 -5.1477
-6.1894 33.8916 3.1746 -10.834 -5.5156 1.8543 17.1196
5.9531 4.998 -1.3186 14.8551 11.67 -5.3186 -7.4919
-8.5604 -0.9266 1.9944 -12.1968 1.5891 -7.8631 9.3623
3.3648 -16.7637 15.3934 -5.5406 -0.8632 8.0575 9.5061
-2.7075 -8.618 -6.7226 -1.975 15.2105 2.6138 -14.8284
-7.8372 -1.3559 18.3065 -14.2567 -6.4394 5.7315 -1.872
-0.3169 -9.9577 -0.3303 5.0577 -7.3855 -0.0259 -4.6367
3.3548 1.8679 -0.8338 -7.285 -0.2951 -3.9358 -0.6508
-2.8861]]]

The two arrays look quite different. Am I missing something?

Significant differences between features before and after the quantizer are expected, since the quantizer is learned end-to-end with the whole system and certain loss functions don't necessarily reward a transparent quantizer, but a system that generates plausible audio at the decoder.

Significant differences between features before and after the quantizer are expected, since the quantizer is learned end-to-end with the whole system and certain loss functions don't necessarily reward a transparent quantizer, but a system that generates plausible audio at the decoder.

Thank you! Now I've put everything to work.

Glad to hear that.