High latency for netty-loom POST processing
kciesielski opened this issue · comments
The netty-loom
backend shows noticeably worse latency distribution during the PostBytes
performance test:
Let's run the test with profiling and look for possible improvements.
To run the server:
perfTests/runMain sttp.tapir.perf.apis.ServerRunner netty.loom.Tapir
To attach async-profiler:
asprof -e cpu,alloc,lock -f profile.jfr <PID>
To run the test:
perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -m PostBytes -u 128 -d 30
(pick concurrency level that would run decently on your machine)
To generate flamegraph with async-profiler converter:
java -cp ./converter.jar jfr2flame ./profile.jfr flamegraph.html
@kciesielski what hardware/environment was this tested on?
@ghik Ryzen 9 5950X (32) @3.4GHz, 64GB RAM, using JDK 21
Here's a summary of my attempts at reproducing this issue in cloud environment, with client and server on separate machines.
Hardware:
- Client: c5.2xlarge EC2 instance (8 cores, 32GB RAM)
- Server: t3.xlarge EC2 instance (4 cores, 16GB RAM)
I compared netty-loom
with netty-future
with 1, 2, 4, 8, 16, 32, 64, and 128 concurrent users, running the benchmark for 60 seconds. Both client and server CPU usage was monitored to ensure that the CPU usage was not saturated on any machine. The scenario with 128 concurrent users was the only one close to doing so. This is important, as measuring latency of a system loaded to its limits does not represent a real life scenario and is likely to yield significantly worse numbers.
Overall, I was unable to reproduce the tail latency problems. From my perspective, netty-loom
and netty-future
show comparable metrics, and any differences are likely to be just noise. Some runs randomly showed unusually high tail latency, e.g. loom.128
in the following run:
However, upon running the same test again, this difference disappeared. Therefore I think that any differences are likely to be caused by random external factors like some unexpected, temporary additional load on client or server machine, and should not be attributed to tapir
or any of its backends.
Thanks a lot for this effort @ghik, let's close the issue then.