Metered tail latency not stable in h2 due to slow running threads

Question

Metered tail latency not stable in h2 due to slow running threads

kk2049 opened this issue 4 months ago · comments

Hi!

I notice that the metered tail latency results in your h2 benchmark are relatively unstable. I assume the "metered" algorithm will amplify the impact of queuing during GC pause and make metered tail latency significantly bigger than simple tail latency. But in practice, we sometimes get a strange low metered tail latency, approaching the simple tail latency.

===== DaCapo `simple tail latency`: 50% 311 usec, 90% 1592 usec, 99% 2383 usec, 99.9% 22069 usec, 99.99% 25663 usec, max 26932 usec, measured over 100000 events =====
===== DaCapo `metered tail latency`: 50% 334 usec, 90% 1746 usec, 99% 5732 usec, 99.9% 22083 usec, 99.99% 25663 usec, max 26932 usec, measured over 100000 events =====

Environment Setup

The DaCapo we are using is the latest release DaCapo 23.11-chopin, running on openjdk-11.0.2.

We are using the default G1 Garbage Collector, setting the heap size to 1354MB, which is twice the minimum heap size required to run H2 with G1. The full command line we are using is as follows.

numactl -C0-7 https://github.com/kk2049/blog-img/blob/c73e330011249aa6b9d6f8a0ed9ae4ebd016d708/jdk-11.0.2/bin/java -XX:ParallelGCThreads=8 -XX:ConcGCThreads=2 -XX:InitiatingHeapOccupancyPercent=65 -Xmx1354m -Xms1354m -jar $DACAPO_PATH h2 -n 5 --latency-csv

The CPU used for this test is Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (which is a server specification CPU). We bind DaCapo to 8 physical cores (no hyper-threading issue). All cores were set to performance mode to reduce the impact of dynamic frequency changes.

We have also conducted this test on an up-to-date consumer-grade CPU (i7-13700), and although the probability of triggering the issue is relatively low, we have identified the same problem.

Analysis

We have analyzed one of the relatively low metered-latency test cases and extracted the timestamps using the --latency-csv parameter, resulting in the graph below.

The x-axis in Figure 1 represents the index of the request, while the y-axis represents some timestamps. The blue line represents the simple start time of all these requests, while the red line represents the synthstart (which is a var used in the calculation of metered latency).

Similarly, in Figure 2-1, the x-axis represents the index of the request, and the y-axis shows the difference between metered latency and simple latency, which is adjusted by the metered latency algorithm.

You may notice that there is a strange upward "tail" in the red box of Figure 1. This affected the calculation of the synthstart in the metered latency algorithm below, causing strange metered latency.

157     float[] sorted = Arrays.copyOf(txbegin, events);
158     Arrays.sort(sorted);
159     double len = sorted[sorted.length-1]-sorted[0];
160     double synthstart = 0;
161     for(int i = 0; i < events; i++) {
162         int pos = Arrays.binarySearch(sorted, txbegin[i]);
163         synthstart = sorted[0] + (len*(double) pos / (double) txbegin.length);
164         int actual = (int) ((txend[i] - txbegin[i])/1000);
165         int synth = (int) ((txend[i] - synthstart)/1000);
166         latency[i] = (synth > actual) ? synth : actual;
167     }

The metered latency algorithm will use the start time of the very first request and last request to generate a synthstart for all other requests (L159 & L163) (as the red line I added to Figure 1), causing most requests to get an extra large synthstart (most time larger than its actual end time). This means that you will get a negative synth latency and trigger a fallback(L166) to simple latency, resulting in an overall smaller metered tail latency. Here we also present the synth(L165) - actual(L164) in Figure 2-2, where each of these negative number represents a fallback to simple latency.

(BTW, in the right-most of Figure 2-1, the last request gets a non-negative delta end-time, which is quite strange since the metered algorithm uses the last request to adjust these timestamps, and the last request itself should not be changed. This problem might be due to a mistake in L163, a - 1 was missing. Change L163 to synthstart = sorted[0] + (len * (double) pos / (double) (txbegin.length - 1)) might be better. )

By making some slight modifications to the h2 submitter, I printed the time each thread takes to finish its requests. I found that this "tail" is due to some threads finishing shortly slower than other threads.

Thread 2 finishs using 7543818332ns
Thread 3 finishs using 7664569404ns
Thread 0 finishs using 7717729984ns
Thread 1 finishs using 7743813100ns
Thread 4 finishs using 7840525012ns
Thread 5 finishs using 7866121372ns
Thread 7 finishs using 7911626720ns
Thread 6 finishs using 7998111288ns

Once part of the workers have completed their tasks, the decrease in concurrency causes the slope in Figure 1 to rise. (e.g. When 4 of 8 workers finish early, the slope will double since the index on the x-axis will increase twice as slowly as before while the timestamp on the y-axis will increase as usual)

I believe it is quite common for some workers to run slower, as all the request types are determined randomly. Additionally, multiple JVM threads operate on the same core alongside these mutators, which might influence the mutator's performance.

Possible Fix

Maybe use a centralized global counter for all worker threads instead of deciding their request number before the test starts. (I'm not sure whether this will lead to serious contention)
Let T1 be the completion time of the first thread to finish, ignore all the request that has a start timestamp larger than T1 when calculating metered latency might be able to fix this problem. We have tried this method and generated Figure 2-3. It seems that applying this will make the situation slightly better, some positive triangles indicating a GC happens can be seen at the right part of the plot. But there are still a lot of negative numbers in the plot, which I think is hard to fix.

Discussion

By the way, we have developed a modified version of the DaCapo h2 benchmark called h2-throttle in order to get a detailed latency-throughput curve. It will limit the request issue speed by making a static timeline before starting the test. Our h2-throttle also helps measure metered latency in a better way.

In h2-throttle, the pre-determined timeline decides when should a request be issued. Once a worker has finished its prior request, it will check the timeline and decide whether to wait or to start the next request at once.

So after one worker has been blocked by a GC pause, it will "chase" the timeline as if it is processing the requests delayed by GC from a queue. In h2-throttle, we can measure metered latency simply by minus actual finish time with the timeline scheduled time.

This design is trying to simulate such a system: a client is issuing requests at a fixed speed into a pending queue, and the server will do its best to handle these requests. The metered latency in our h2-throttle represents the time between client issue time and server finish time (which includes queueing time).

This modification has the following advantages:

It allows developers to generate a detailed latency-throughput curve by setting different issue rates(throttle) and testing their metered latency.
Each thread will run at a consistent speed, eliminating the previous issue of varying thread speeds.
It will generate a more stable metered tail latency since most noise will stop influencing the final result once the thread catches up with the timeline.
It will also generate a more precise measurement of GC's impact on tail latency. Both the width and the height of the "tail" in the latency-throughput curve will be directly influenced by the GC pause time. In Figure 4-1, you can see the synth - actual graph (as fig2-2 defined) after we caculate metered latency in our h2-throttle. Each GC will lead to a positive triangle in the graph, indicating the GC's impact on metered tail latency.

However, since all these threads do not start and end at the same moment, the concurrency at the head and tail is not equal to its maximum. This leads to the strange blue area near the x-axis. To deal with this problem, we cut off the head and tail part when we calculate the synth and generates Figure 4-2, which seems to be more reasonable.

If you find this modification interesting, please contact us! We will be more than happy to make contributions to the DaCapo community.

Steve Blackburn · Answer 1 · Thu Apr 04 2024 08:46:35 GMT+0800 (China Standard Time)

Thanks for raising this.

1. The off-by-one error on L163 looks wrong. Thank you. I will investigate.

2. Aside from that off-by-one error you spotted, I think the metered latency is working as expected.

Maybe use a centralized global counter for all worker threads instead of deciding their request number before the test starts.

The code you're referring to is TPCC, which is an industry-standard benchmark. I agree that that behavior means that there will be fall off in throughput toward the end of the run as the number of threads. However: a) TPCC is a standard workload, so I would rather not deviate from it in some non-standard way, b) the effect you're concerned with shrinks as the execution time of the workload increases (which you can do by using huge or large workload sizes).

3. Can you please create a separate issue for your proposal to extend h2? It is unrelated. It's best to use a separate github issue for each separate concern.

Thanks again for raising this.

Steve Blackburn · Answer 2 · Thu Apr 18 2024 12:38:07 GMT+0800 (China Standard Time)

I fixed the calculation of metered latency in #267.

This very minor change adjusted the period over which synthetic starts are calculated, which will have negligible observable effect in the reported metered latency. As far as I am aware the implementation of metered latency is now correct and behaving as intended.

Steve Blackburn · Answer 3 · Thu Apr 18 2024 13:27:10 GMT+0800 (China Standard Time)

I have looked further into your questions about TPC-C.

The TPC-C spec models separate "terminals", each of which will execute a particular workload independently, following a prescribed probability of executing various operations over the shared database. The existing implementation which fixes the terminal workloads prior to the benchmark starting is consistent with the specification. This will result in different 'terminals' (workers) finishing at different times based on exactly which work they perform and various other factors such as scheduling etc. This is expected behavior.

The problem that you highlight is that different workers finish at different times.

This is generally true for the parallelization of any workload. Even if work were globally issued (as you suggest), different threads will finish at different times according to other factors including operating system scheduling, the relative time taken to complete the randomly assigned work items, etc. Thus the problem you observe is one that will always exist to different extents and is unsurprising. The simple mitigation is simply to run the workload longer so that the impact of the problem is amortized.

In the case of h2 the simple and metered latencies become very similar (or even identical) on later iterations and/or with larger heaps since the memory profile of the workload means that once warmed up it may perform very little (or even zero) garbage collection work in a given iteration.

For the above reasons, I do not plan to change the implementation of TPC-C in the h2 benchmark.

As mentioned earlier, if you have a separate proposal to throttle h2, please make this proposal in a separate issue.