High latency and resource consumption due to retry requests
Zelldon opened this issue · comments
Please see original issues for more details camunda/zeebe#2778
Expected behavior
On high load no weird latency pattern.
Actual behavior
We saw in our APP latency pattern like this and the root cause seems to be atomix.
$ ghz -insecure --proto ./gateway-protocol/src/main/proto/gateway.proto --call gateway_protocol.Gateway.CreateWorkflowInstance -d '{"workflowKey": 2251799813685249}' -n 2000 localhost:26500
Summary:
Count: 2000
Total: 73683.84 ms
Slowest: 5075.27 ms
Fastest: 25.08 ms
Average: 1742.50 ms
Requests/sec: 27.14
Response time histogram:
25.077 [1] |
530.096 [1339]|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1035.115 [0] |
1540.135 [0] |
2045.154 [0] |
2550.173 [0] |
3055.192 [0] |
3560.211 [0] |
4065.230 [0] |
4570.249 [0] |
5075.268 [660]|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Latency distribution:
10% in 60.69 ms
25% in 116.77 ms
50% in 157.56 ms
75% in 5019.84 ms
90% in 5039.01 ms
95% in 5048.32 ms
99% in 5061.54 ms
Status code distribution:
[OK] 2000 responses
We wrote an test example and this seem to block often (kind of deadlock) and mostly end in out of memory.
Minimal yet complete reproducer code (or URL to code)
package io.atomix.protocols.raft.test;
import io.atomix.cluster.MemberId;
import io.atomix.cluster.Node;
import io.atomix.cluster.discovery.BootstrapDiscoveryProvider;
import io.atomix.core.Atomix;
import io.atomix.core.list.AsyncDistributedList;
import io.atomix.protocols.raft.partition.RaftPartitionGroup;
import io.atomix.storage.StorageLevel;
import java.io.File;
import java.time.Duration;
import java.time.Instant;
import java.util.Arrays;
import java.util.concurrent.CountDownLatch;
public class LatencyTest {
public static void main(String[] args) throws InterruptedException {
new LatencyTest().run();
}
private final MemberId member = MemberId.anonymous();
public void run() throws InterruptedException {
final RaftPartitionGroup system =
RaftPartitionGroup.builder("system")
.withMembers(member)
.withNumPartitions(1)
.withPartitionSize(1)
.withDataDirectory(new File(String.format("target/perf-logs/%s/system", member.id())))
.build();
final RaftPartitionGroup data =
RaftPartitionGroup.builder("data")
.withMembers(member)
.withNumPartitions(1)
.withPartitionSize(1)
.withStorageLevel(StorageLevel.DISK)
.withFlushOnCommit(false)
.withDataDirectory(new File(String.format("target/perf-logs/%s/data", member.id())))
.build();
final Atomix atomix =
Atomix.builder()
.withMemberId(member)
.withShutdownHookEnabled()
.withMembershipProvider(
BootstrapDiscoveryProvider.builder()
.withNodes(Node.builder().withId(member.id()).build())
.build())
.withManagementGroup(system)
.withPartitionGroups(data)
.build();
atomix.start().join();
final AsyncDistributedList<Integer> list = atomix.<Integer>getList("list").async();
// warmup
System.out.println("Warming up");
final int warmUpCount = 1000;
final CountDownLatch warmUpLatch = new CountDownLatch(warmUpCount);
for (int i = 0; i < warmUpCount; i++) {
list.add(1).thenRun(warmUpLatch::countDown);
}
warmUpLatch.await();
list.clear().join();
System.out.println("Finished warming up");
final int workCount = 15_000;
final long[] latencies = new long[workCount];
final CountDownLatch latch = new CountDownLatch(workCount);
for (int i = 0; i < workCount; i++) {
final Instant now = Instant.now();
final int index = i;
list.add(i, i)
.whenComplete(
(nothing, error) -> {
final Duration between = Duration.between(now, Instant.now());
latencies[index] = between.toMillis();
// if (between.compareTo(Duration.ofSeconds(5)) >= 0) {
// System.out.println("one request has latency at least greater than one second " + between.toString());
// }
if (index % 100 == 0) {
System.out.println("Counted down " + index);
}
latch.countDown();
});
}
latch.await();
final int[] buckets = new int[] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
for (int i = 0; i < workCount; i++) {
if (latencies[i] < 10_000) {
buckets[(int)Math.floorDiv(latencies[i], 1000)]++;
} else {
buckets[9]++;
}
}
System.out.println("Finished, latencies grouped in order: " + Arrays.toString(buckets));
}
}
Environment
- Atomix: 3.1.5 [e.g. 3.0.0]
- OS: Linux zell-arch 5.1.16-arch1-1-ARCH #1 SMP PREEMPT Wed Jul 3 20:23:07 UTC 2019 x86_64 GNU/Linux
[e.g.uname -a
] - JVM 8 [e.g.
java -version
]
Copied from camunda/zeebe#2778:
Problem
It can happen that we miss a command, because of #1044.
In LeaderRole#onCommand
we apply the commands and recognize if one is missing.
As an example say we miss a command on position X
.
If we see that we missed one we will buffer all other commands which came after the missed position X
.
After we collect 1000
commands, on the 1000+1
command we will send a COMMAND_FAILURE
as reply back to the RaftSessionInvoker
.
The reply will contain also the last successful applied position X-1
. The COMMAND_FAILURE
signals that the server
received events out-of-order.
This will trigger the following:
- The
RaftSessionInvoker
will callresubmit(
X-1, attempt);
, where attempt is the command which got anCOMMAND_FAILURE
response. - In
resubmit
the operations are retried larger thenX-1
, this includesX
.
The problem with this is the following:
Say we have send 1000+n
commands after the missing command, where n
is larger then 1
.
Then on 1000+1
and all following until 1000+n
we get an COMMAND_FAILURE
response, this means n
COMMAND_FAILURE
responses. On each COMMAND_FAILURE
responses we will retry
1000+n
operations. This means we will send n * (1000 + n)
commands. This can cause a lot of traffic and could end in out of memory.
Fix
The main problem is that OperationAttempt#retry
retries the commands asynchronously, which then caused a lot of traffic in the end because we were not able to detect if we already retried this operation or not. It is not necessary to send these commands async, in the RaftSessionInvoker
we only have one thread. If we send it directly in resubmit
we can check on the next iteration whether the operation was already retried or not.
BTW:
I think it is not necessary to collect 1000
commands before we send the first COMMAND_FAILURE
, which means we could send directly the reply since all other commands are resend anyway.