High latency and resource consumption due to retry requests

Question

High latency and resource consumption due to retry requests

Zelldon opened this issue 5 years ago · comments

Christopher Kujawa (Zell) commented 5 years ago

Please see original issues for more details camunda/zeebe#2778

Expected behavior

On high load no weird latency pattern.

Actual behavior

We saw in our APP latency pattern like this and the root cause seems to be atomix.

$ ghz -insecure --proto ./gateway-protocol/src/main/proto/gateway.proto --call gateway_protocol.Gateway.CreateWorkflowInstance -d '{"workflowKey": 2251799813685249}' -n 2000 localhost:26500

Summary:
  Count:        2000
  Total:        73683.84 ms
  Slowest:      5075.27 ms
  Fastest:      25.08 ms
  Average:      1742.50 ms
  Requests/sec: 27.14

Response time histogram:
  25.077 [1]    |
  530.096 [1339]|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1035.115 [0]  |
  1540.135 [0]  |
  2045.154 [0]  |
  2550.173 [0]  |
  3055.192 [0]  |
  3560.211 [0]  |
  4065.230 [0]  |
  4570.249 [0]  |
  5075.268 [660]|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎

Latency distribution:
  10% in 60.69 ms
  25% in 116.77 ms
  50% in 157.56 ms
  75% in 5019.84 ms
  90% in 5039.01 ms
  95% in 5048.32 ms
  99% in 5061.54 ms
Status code distribution:
  [OK]  2000 responses

We wrote an test example and this seem to block often (kind of deadlock) and mostly end in out of memory.

Minimal yet complete reproducer code (or URL to code)

package io.atomix.protocols.raft.test;

import io.atomix.cluster.MemberId;
import io.atomix.cluster.Node;
import io.atomix.cluster.discovery.BootstrapDiscoveryProvider;
import io.atomix.core.Atomix;
import io.atomix.core.list.AsyncDistributedList;
import io.atomix.protocols.raft.partition.RaftPartitionGroup;
import io.atomix.storage.StorageLevel;

import java.io.File;
import java.time.Duration;
import java.time.Instant;
import java.util.Arrays;
import java.util.concurrent.CountDownLatch;

public class LatencyTest {
  public static void main(String[] args) throws InterruptedException {
    new LatencyTest().run();
  }

  private final MemberId member = MemberId.anonymous();
  public void run() throws InterruptedException {
    final RaftPartitionGroup system =
        RaftPartitionGroup.builder("system")
            .withMembers(member)
            .withNumPartitions(1)
            .withPartitionSize(1)
            .withDataDirectory(new File(String.format("target/perf-logs/%s/system", member.id())))
            .build();
    final RaftPartitionGroup data =
        RaftPartitionGroup.builder("data")
            .withMembers(member)
            .withNumPartitions(1)
            .withPartitionSize(1)
            .withStorageLevel(StorageLevel.DISK)
            .withFlushOnCommit(false)
            .withDataDirectory(new File(String.format("target/perf-logs/%s/data", member.id())))
            .build();

    final Atomix atomix =
        Atomix.builder()
            .withMemberId(member)
            .withShutdownHookEnabled()
            .withMembershipProvider(
                BootstrapDiscoveryProvider.builder()
                    .withNodes(Node.builder().withId(member.id()).build())
                    .build())
            .withManagementGroup(system)
            .withPartitionGroups(data)
            .build();

    atomix.start().join();

    final AsyncDistributedList<Integer> list = atomix.<Integer>getList("list").async();

    // warmup
    System.out.println("Warming up");
    final int warmUpCount = 1000;
    final CountDownLatch warmUpLatch = new CountDownLatch(warmUpCount);
    for (int i = 0; i < warmUpCount; i++) {
      list.add(1).thenRun(warmUpLatch::countDown);
    }
    warmUpLatch.await();
    list.clear().join();
    System.out.println("Finished warming up");

    final int workCount = 15_000;
    final long[] latencies = new long[workCount];
    final CountDownLatch latch = new CountDownLatch(workCount);
    for (int i = 0; i < workCount; i++) {
      final Instant now = Instant.now();
      final int index = i;
      list.add(i, i)
          .whenComplete(
              (nothing, error) -> {
                final Duration between = Duration.between(now, Instant.now());
                latencies[index] = between.toMillis();
//                if (between.compareTo(Duration.ofSeconds(5)) >= 0) {
//                  System.out.println("one request has latency at least greater than one second " + between.toString());
//                }
                if (index % 100 == 0) {
                  System.out.println("Counted down " + index);
                }
                latch.countDown();
              });
    }

    latch.await();

    final int[] buckets = new int[] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
    for (int i = 0; i < workCount; i++) {
      if (latencies[i] < 10_000) {
        buckets[(int)Math.floorDiv(latencies[i], 1000)]++;
      } else {
        buckets[9]++;
      }
    }

    System.out.println("Finished, latencies grouped in order: " + Arrays.toString(buckets));
  }
}

Environment

Atomix: 3.1.5 [e.g. 3.0.0]
OS: Linux zell-arch 5.1.16-arch1-1-ARCH #1 SMP PREEMPT Wed Jul 3 20:23:07 UTC 2019 x86_64 GNU/Linux
[e.g. uname -a]
JVM 8 [e.g. java -version]

Christopher Kujawa (Zell) · Answer 1 · Wed Jul 17 2019 14:56:05 GMT+0800 (China Standard Time)

Copied from camunda/zeebe#2778:

Problem

It can happen that we miss a command, because of #1044.

In LeaderRole#onCommand we apply the commands and recognize if one is missing.
As an example say we miss a command on position X.
If we see that we missed one we will buffer all other commands which came after the missed position X.
After we collect 1000 commands, on the 1000+1 command we will send a COMMAND_FAILURE as reply back to the RaftSessionInvoker.
The reply will contain also the last successful applied position X-1. The COMMAND_FAILURE signals that the server received events out-of-order.

This will trigger the following:

The RaftSessionInvoker will call resubmit(X-1, attempt);, where attempt is the command which got an COMMAND_FAILURE response.
In resubmit the operations are retried larger then X-1, this includes X.

The problem with this is the following:
Say we have send 1000+n commands after the missing command, where n is larger then 1.
Then on 1000+1 and all following until 1000+n we get an COMMAND_FAILURE response, this means n COMMAND_FAILURE responses. On each COMMAND_FAILURE responses we will retry
1000+n operations. This means we will send n * (1000 + n) commands. This can cause a lot of traffic and could end in out of memory.

Fix

The main problem is that OperationAttempt#retry retries the commands asynchronously, which then caused a lot of traffic in the end because we were not able to detect if we already retried this operation or not. It is not necessary to send these commands async, in the RaftSessionInvoker we only have one thread. If we send it directly in resubmit we can check on the next iteration whether the operation was already retried or not.

BTW:

I think it is not necessary to collect 1000 commands before we send the first COMMAND_FAILURE, which means we could send directly the reply since all other commands are resend anyway.