v1.3.2 is much slower than v1.3.1 if it's built into WebAssembly

Question

v1.3.2 is much slower than v1.3.1 if it's built into WebAssembly

sile opened this issue a year ago · comments

[NOTE] This is just an FYI issue as I know this project doesn't officially support WebAssembly.

As I mentioned in #49, shiguredo/lyra-wasm maintains a no-patch WebAssembly build of Lyra.
Today, I updated the Lyra version to 1.3.2 (shiguredo/lyra-wasm#10). However, it turned out that the encoding and decoding peformance is degraded after the update.

The following table is a benchmark result from https://shiguredo.github.io/lyra-wasm/lyra-benchmark.html.
(elapsed times taken to encode / decode 10 seconds audio data)

Browser	Lyra Version	Encode Time	Decode Time
Chrome (m1 mac)	1.3.1	550.230 ms	804.230 ms
Chrome (m1 mac)	1.3.2	898.375 ms	1144.754 ms
Safari (m1 mac)	1.3.1	596.880 ms	866.779 ms
Safari (m1 mac)	1.3.2	905.639 ms	1168.120 ms
Firefox (m1 mac)	1.3.1	540.199 ms	800.540 ms
Firefox (m1 mac)	1.3.2	609.940 ms	1064.080 ms
Chrome (android)	1.3.1	1002.769 ms	1040.140 ms
Chrome (android)	1.3.2	1398.920 ms	1621.900 ms

I don't know the reason of this performance drop.
Any information that helps alleviate this problem is more than welcomed.

Michael Chinen · Answer 1 · Thu Dec 22 2022 02:53:58 GMT+0800 (China Standard Time)

Hi @sile, thank you for the benchmarking (and the nice benchmarking tool) on WASM. Indeed, this is unexpected.
There are some possible causes of the drop in performance on WASM:

The upgrade to TensorFlow 2.11 may have caused some issues for WASM. We saw a speed increase when running on Android natively (and an even further speed increase from using the TF repo's head), and we also noted that the logging showed that XNNPACK said that more modules were used. In particular, we see VERBOSE: Replacing 126 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 33 partitions. with the TF 2.11 upgrade, which was more than TF 2.9 (94 nodes/15 partitions). It would be good to check the logs to see if WASM sees similar numbers of nodes/partitions to make sure the accelerated XNNPACK path is being used.
There is a possibility that the TFLITE_XNNPACK_DELEGATE_FLAG_QU8 in tflite_model_wrapper.cc flag caused an issue.
I see that you benchmark on 500 frames. We test on a longer test sample now, encoding 10000 20ms frames, and we do see the initial ~30 frames going significantly slower, perhaps due to initialization/caching (we haven't looked deeply into this yet). It might be good to see if this has an effect. I don't expect it to explain the full difference though.

We will continue to look into this, but we first need to get set up to benchmark WASM ourselves. Feel free to play with the above settings in the meantime. Let me know if you have any questions or want to chat about this.

Takeru Ohta · Answer 2 · Thu Dec 22 2022 07:20:02 GMT+0800 (China Standard Time)

Thank you for your reply!
I am off for a while. So I will look at the detail when I am back to work.

Takeru Ohta · Answer 3 · Wed Dec 28 2022 13:22:50 GMT+0800 (China Standard Time)

Hi @mchinen, thank you again for the detailed advice. I tried some of them, so let me share the result.

There is a possibility that the TFLITE_XNNPACK_DELEGATE_FLAG_QU8 in tflite_model_wrapper.cc flag caused an issue.

This diff seems having huge impact on the performance degradation.
I tried reverting this change. Then the encode / decode time of the patched v1.3.2 became comparable to v1.3.1 (see the table below).

Browser	Lyra Version	Encode Time	Decode Time
Chrome (m1 mac)	1.3.1	570.260 ms	821.094 ms
Chrome (m1 mac)	1.3.2	918.710 ms	1160.914 ms
Chrome (m1 mac)	1.3.2 (patched)	569.240 ms	829.614 ms

I see that you benchmark on 500 frames. We test on a longer test sample now, encoding 10000 20ms frames,

I ran the benchmark with setting ITERATIONS to 10000. The performance degradation still existed as before (i.e., ITERATIONS=500) as shown in the following table.

Browser	Lyra Version	Encode Time	Decode Time
Chrome (m1 mac)	1.3.1	11276.710 ms	16442.069 ms
Chrome (m1 mac)	1.3.2	18254.064 ms	23232.560 ms

We saw a speed increase when running on Android natively

Let me confirm that, is 47698da#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5 the benchmark result you mentioned in the above comment?
If so, it seems the perfomance on Android has dropped slightly on v1.3.2. Seeing the following diff, that is quoted from the full diff between v1.3.1 and v1.3.2, it says "to decode one frame, v1.3.1 takes 0.473 ms and v1.3.2 takes 0.525 ms" (I might misunderstand something though).

  This shows that decoding a 50Hz frame (each frame is 20 milliseconds) takes
- 0.473 milliseconds on average. So decoding is performed at around 42 (20/0.473)
+ 0.525 milliseconds on average. So decoding is performed at around 38 (20/0.525)
  times faster than realtime.

Michael Chinen · Answer 4 · Wed Jan 04 2023 05:01:08 GMT+0800 (China Standard Time)

Thanks @sile! I'm leaning to reverting the flag change while we continue to look into it. It seems the effect of the flag depends largely on which version of TF we are using, and it is different for each platform.

Regarding the benchmark, I think it's a red herring. The new version's benchmark on native Android actually doesn't reflect a drop in speed due to that flag on Android. Rather, the reason the new benchmark is slower is because we did the earlier benchmarking on our internal version, which uses a different toolchain and newer version of TF, which is not appropriate for our open source users (since the internal version happens to be slightly faster). Hope that clears that up!

Takeru Ohta · Answer 5 · Wed Jan 04 2023 08:58:48 GMT+0800 (China Standard Time)

Make sense, thanks!
I'm looking forward to seeing a newer version that fixes the performance issue on WebAssembly.

Tawfik Boujeh · Answer 6 · Sat Mar 18 2023 03:42:53 GMT+0800 (China Standard Time)

how is it compared to OPUS though?