High latency for netty-loom POST processing

Question

High latency for netty-loom POST processing

kciesielski opened this issue 3 months ago · comments

Krzysztof Ciesielski commented 3 months ago

The netty-loom backend shows noticeably worse latency distribution during the PostBytes performance test:

Let's run the test with profiling and look for possible improvements.
To run the server:
perfTests/runMain sttp.tapir.perf.apis.ServerRunner netty.loom.Tapir
To attach async-profiler:
asprof -e cpu,alloc,lock -f profile.jfr <PID>
To run the test:
perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -m PostBytes -u 128 -d 30 (pick concurrency level that would run decently on your machine)
To generate flamegraph with async-profiler converter:
java -cp ./converter.jar jfr2flame ./profile.jfr flamegraph.html

Roman Janusz · Answer 1 · Wed Mar 27 2024 19:39:46 GMT+0800 (China Standard Time)

@kciesielski what hardware/environment was this tested on?

Krzysztof Ciesielski · Answer 2 · Thu Mar 28 2024 15:21:22 GMT+0800 (China Standard Time)

@ghik Ryzen 9 5950X (32) @3.4GHz, 64GB RAM, using JDK 21

Roman Janusz · Answer 3 · Thu Apr 04 2024 22:38:25 GMT+0800 (China Standard Time)

Here's a summary of my attempts at reproducing this issue in cloud environment, with client and server on separate machines.

Hardware:

Client: c5.2xlarge EC2 instance (8 cores, 32GB RAM)
Server: t3.xlarge EC2 instance (4 cores, 16GB RAM)

I compared netty-loom with netty-future with 1, 2, 4, 8, 16, 32, 64, and 128 concurrent users, running the benchmark for 60 seconds. Both client and server CPU usage was monitored to ensure that the CPU usage was not saturated on any machine. The scenario with 128 concurrent users was the only one close to doing so. This is important, as measuring latency of a system loaded to its limits does not represent a real life scenario and is likely to yield significantly worse numbers.

Overall, I was unable to reproduce the tail latency problems. From my perspective, netty-loom and netty-future show comparable metrics, and any differences are likely to be just noise. Some runs randomly showed unusually high tail latency, e.g. loom.128 in the following run:

However, upon running the same test again, this difference disappeared. Therefore I think that any differences are likely to be caused by random external factors like some unexpected, temporary additional load on client or server machine, and should not be attributed to tapir or any of its backends.

Krzysztof Ciesielski · Answer 4 · Fri Apr 05 2024 14:58:10 GMT+0800 (China Standard Time)

Thanks a lot for this effort @ghik, let's close the issue then.