softwaremill / tapir

Rapid development of self-documenting APIs

Home Page:https://tapir.softwaremill.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High latency for netty-loom POST processing

kciesielski opened this issue · comments

The netty-loom backend shows noticeably worse latency distribution during the PostBytes performance test:

hdr-post

Let's run the test with profiling and look for possible improvements.
To run the server:
perfTests/runMain sttp.tapir.perf.apis.ServerRunner netty.loom.Tapir
To attach async-profiler:
asprof -e cpu,alloc,lock -f profile.jfr <PID>
To run the test:
perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -m PostBytes -u 128 -d 30 (pick concurrency level that would run decently on your machine)
To generate flamegraph with async-profiler converter:
java -cp ./converter.jar jfr2flame ./profile.jfr flamegraph.html

@kciesielski what hardware/environment was this tested on?

@ghik Ryzen 9 5950X (32) @3.4GHz, 64GB RAM, using JDK 21

Here's a summary of my attempts at reproducing this issue in cloud environment, with client and server on separate machines.

Hardware:

  • Client: c5.2xlarge EC2 instance (8 cores, 32GB RAM)
  • Server: t3.xlarge EC2 instance (4 cores, 16GB RAM)

I compared netty-loom with netty-future with 1, 2, 4, 8, 16, 32, 64, and 128 concurrent users, running the benchmark for 60 seconds. Both client and server CPU usage was monitored to ensure that the CPU usage was not saturated on any machine. The scenario with 128 concurrent users was the only one close to doing so. This is important, as measuring latency of a system loaded to its limits does not represent a real life scenario and is likely to yield significantly worse numbers.

Overall, I was unable to reproduce the tail latency problems. From my perspective, netty-loom and netty-future show comparable metrics, and any differences are likely to be just noise. Some runs randomly showed unusually high tail latency, e.g. loom.128 in the following run:

image

However, upon running the same test again, this difference disappeared. Therefore I think that any differences are likely to be caused by random external factors like some unexpected, temporary additional load on client or server machine, and should not be attributed to tapir or any of its backends.

Thanks a lot for this effort @ghik, let's close the issue then.