Correct tflite model usage/pipeline?

Question

Correct tflite model usage/pipeline?

josephrocca opened this issue 2 years ago · comments

My current guess/understanding is:

Get 320 samples of a 16 khz audio file
Feed that into soundstream_encoder.tflite as float32[1,320] and get float32[1,1,64] data as output.
(???) Somehow quantize, transmit over network, and then dequantize using quantizer.tflite? It has an input shape of int32[46,1,1], so I'm not sure what to do with that.
Feed the output of dequantization process (float32[1,1,64]) to lyragan.tflite to produce the reconstructed samples (float32[1,320])

I'm not able to test this guess/understanding right now because I'm blocked on some issues, but I'm aiming to get a minimal open source web demo working using either tfjs-tflite, or ONNX Runtime Web, so if anyone could correct my understanding here, that would be great!

Michael Chinen · Answer 1 · Mon Oct 03 2022 13:07:46 GMT+0800 (China Standard Time)

Hi there, that sounds interesting. Your understanding is basically correct.

I would recommend you look at our example in encoder_main.cc for steps 1, 2, and 3. It produces a .lyra file which contains the quantized data that would be transmitted over the network. decoder_main.cc does the dequantize/reconstruction step.

These examples run on files (faster than real-time). Some work would be needed to use this in a real-time streaming application.

Alejandro Luebs · Answer 2 · Tue Oct 04 2022 02:47:30 GMT+0800 (China Standard Time)

I am excited about the web demo! Please keep us posted about your progress.
I would agree that your understanding is mostly correct, particularly about the soundstream_encoder (samples to features) and lyragan (features to samples) models. I would add that the quantizer is a single model with 2 signatures: encode (features to bits) and decode (bits to features).
The 46 you are seeing in the shape of the quantizer interface comes from the maximum number of quantizers, each with 4 bits. That's a total of 46x4=184bits every 20ms, which makes it 9.2kbps. For 3.2kbps and 6kbps only 16 and 30 quantizers are used respectively.
You can see how the 2 signatures are loaded and called in the Residual Vector Quantizer class.

josephrocca · Answer 3 · Tue Oct 04 2022 05:43:01 GMT+0800 (China Standard Time)

Ahh, okay, I see now in netron.app that the .tflite file has two subgraphs - missed that! 😅 Thanks also for explaining the 46/30/16 stuff @aluebs.

I'll post the demo here or in a new issue when I've got it working.

(Also wanted to say thanks to both of you and all involved in working on this and making it open source! LyraV2's performance is incredible, and it's so cool that it's freely available for all to use 🙏)

Alejandro Luebs · Answer 4 · Tue Oct 04 2022 05:49:28 GMT+0800 (China Standard Time)

Our pleasure. Codecs are about enabling communications. And that is best achieved through interoperability and collaboration.
And it is always great to see cool applications built on top, so I am personally excited about that web demo :)

Tiance Wang · Answer 5 · Sat Oct 08 2022 19:25:08 GMT+0800 (China Standard Time)

Hello, thanks for making the tflite files available. I'm able to run soundstream_encoder with a 320-dim input array and get a 64-dim feature, but don't know how to run the quantizer. Here's roughly what I did:

x = np.ones(320)
encoder = tf.lite.Interpreter('soundstream_encoder.tflite').get_signature_runner()
encoder_out = encoder(input_audio=x)['bottleneck_1/simpleconv']
encoder_quantizer = tf.lite.Interpreter('quantizer.tflite').get_signature_runner('encode')
encoder_quantizer(input_frames=???, num_quantizers=8)

But how should I process encoder_out with it? Feeding encoder_out to input_frames doesn't seem to work.

Alejandro Luebs · Answer 6 · Sat Oct 08 2022 22:47:16 GMT+0800 (China Standard Time)

I have not tried running these TFLite files from python, passing encoder_out as input_frames seems reasonable to me. How does it fail? Any error messages? Beware that the only officially supported values for num_quantizers are 16, 30 and 46.

Tiance Wang · Answer 7 · Sun Oct 09 2022 08:44:59 GMT+0800 (China Standard Time)

Thanks for replying. I figured it out: num_quantizers should be an array instead of a single number.

Tiance Wang · Answer 8 · Sun Oct 09 2022 10:59:23 GMT+0800 (China Standard Time)

The input and output of the quantizer don't seem to match.

x = np.ones(320, dtype=np.float32)
encoder = tf.lite.Interpreter(ENCODER_PATH).get_signature_runner()
encoder_quantizer = tf.lite.Interpreter(QUANTIZER_PATH).get_signature_runner('encode')
decoder_quantizer = tf.lite.Interpreter(QUANTIZER_PATH).get_signature_runner('decode')
encoder_out = encoder(input_audio=x)['bottleneck_1/simpleconv']
encoder_quantizer_out = encoder_quantizer(input_frames=encoder_out, num_quantizers=np.array([46]))['output_0']
decoder_quantizer_out = decoder_quantizer(encoding_indices=encoder_quantizer_out)['output_0']
print(encoder_out)
print(decoder_quantizer_out)

Output:
[[[ 4.6768 5.2019 30.9693 29.1799 -13.9479 -20.2321 -18.9173
-2.0216 71.4695 1.2933 -41.781 -9.7017 14.1581 41.5235
14.7401 3.9903 15.8061 26.8866 23.3161 -9.2959 -24.1347
0.1228 7.6399 -2.2619 -16.2647 2.8482 -12.5169 7.9989
12.2995 -39.2526 18.2285 -16.1468 2.0148 11.51 19.276
0.6773 -4.8266 -8.45 -5.6571 26.0052 -2.7498 -28.4975
-32.0878 0.2983 37.0367 -28.8171 -6.4624 13.8729 6.2805
7.6458 -5.3856 -12.0877 1.0219 -6.0388 -1.9798 -2.223
2.0584 -9.4123 1.2139 -9.2483 5.4903 -7.4411 5.5858
5.0048]]]
[[[ 1.5479 -0.9323 13.378 12.6163 -6.5922 -8.9712 -5.1477
-6.1894 33.8916 3.1746 -10.834 -5.5156 1.8543 17.1196
5.9531 4.998 -1.3186 14.8551 11.67 -5.3186 -7.4919
-8.5604 -0.9266 1.9944 -12.1968 1.5891 -7.8631 9.3623
3.3648 -16.7637 15.3934 -5.5406 -0.8632 8.0575 9.5061
-2.7075 -8.618 -6.7226 -1.975 15.2105 2.6138 -14.8284
-7.8372 -1.3559 18.3065 -14.2567 -6.4394 5.7315 -1.872
-0.3169 -9.9577 -0.3303 5.0577 -7.3855 -0.0259 -4.6367
3.3548 1.8679 -0.8338 -7.285 -0.2951 -3.9358 -0.6508
-2.8861]]]

The two arrays look quite different. Am I missing something?

Alejandro Luebs · Answer 9 · Mon Oct 10 2022 22:44:01 GMT+0800 (China Standard Time)

Significant differences between features before and after the quantizer are expected, since the quantizer is learned end-to-end with the whole system and certain loss functions don't necessarily reward a transparent quantizer, but a system that generates plausible audio at the decoder.

Tiance Wang · Answer 10 · Tue Oct 11 2022 10:08:02 GMT+0800 (China Standard Time)

Significant differences between features before and after the quantizer are expected, since the quantizer is learned end-to-end with the whole system and certain loss functions don't necessarily reward a transparent quantizer, but a system that generates plausible audio at the decoder.

Thank you! Now I've put everything to work.

Alejandro Luebs · Answer 11 · Tue Oct 11 2022 10:37:18 GMT+0800 (China Standard Time)

Glad to hear that.