dacapobench / dacapobench

The DaCapo benchmark suite

Home Page:https://www.dacapobench.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RC2: Timing-sensitive kafka failure

steveblackburn opened this issue · comments

Kafka will fail on its second iteration when run on certain hardware.

Specifics:

  • Regression introduced by 5c4d976 (no failures observed on parent commit)
  • It never fails on the first iteration, always on the second.
  • The bug is completely reproducible but only seen on certain hardware (tested across machines with the same image, so apparently this is genuinely a hardware-specific problem, with either 0% or 100% failure rate):
    • No problem observed: ANU Ryzen 9 7950X Zen 4, AMD Ryzen 9 5950X Zen 3, AMD Ryzen 9 3900X, Kaby Lake desktop machine, Google AMD
    • Problem observed for 100% of executions: ANU i7-6700k Skylake, ANU Ryzen 9 3900X Zen 2, Intel MacBook
  • Fails for all tested JVMs (OpenJDK and Temurin, 11, 17, 20).

It nondeterministically produces one of two failure modes (most frequently the silent failure).

Failure mode 1 (stack trace):

$ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -jar dacapo-evaluation-git-5c4d976c.jar -n 2 kafka
[...]
Using scaled threading model. 8 processors detected, 8 threads used to drive the workload, in a possible range of [1,unlimited]
Version: kafka 3.3.1
Nominal stats: AOA: 4|54, AOL: 4|48, AOM: 9|32, AOS: 3|16, ARA: 2|604, BAL: 2|1, BAS: 4|0, BEF: 8|3, BGF: 2|94, BPF: 2|25, BUB: 8|128, BUF: 8|21, GCA: 5|83, GCC: 10|153, GCM: 5|83, GCP: 1|0, GLK: 3|0, GMH: 9|203, GMU: 8|214, GSS: 1|0, GTO: 3|14, PET: 9|6, PPE: 2|3, PSD: 7|0, PWU: 8|1
Starting Zookeeper...
Starting Kafka Server...
===== DaCapo evaluation-git-5c4d976c kafka starting warmup 1 =====
Trogdor is running the workload....
Starting 1000000 requests...
100%
Completed requests
Finished
===== DaCapo evaluation-git-5c4d976c kafka completed warmup 1 in 12655 msec =====
===== DaCapo simple tail latency: 50% 24038 usec, 90% 135959 usec, 99% 388850 usec, 99.9% 409488 usec, 99.99% 411222 usec, max 413611 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 266059 usec, 90% 412477 usec, 99% 439855 usec, 99.9% 442681 usec, 99.99% 444794 usec, max 445464 usec, measured over 1000000 events =====
===== DaCapo evaluation-git-5c4d976c kafka starting =====
Trogdor is running the workload....
Starting 1000000 requests...
Completed requests
Error while executing topic command : Topic 'dacapo-1,dacapo-2,dacapo-3,dacapo-4' does not exist as expected
[2023-09-09 05:50:19,149] ERROR java.lang.IllegalArgumentException: Topic 'dacapo-1,dacapo-2,dacapo-3,dacapo-4' does not exist as expected
	at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:402)
	at kafka.admin.TopicCommand$TopicService.deleteTopic(TopicCommand.scala:362)
	at kafka.admin.TopicCommand$.main(TopicCommand.scala:63)
	at kafka.admin.TopicCommand.main(TopicCommand.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.dacapo.kafka.ClientRunner.runClient(Unknown Source)
	at org.dacapo.kafka.Launcher.performIteration(Unknown Source)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.dacapo.harness.Kafka.iterate(Kafka.java:53)
	at org.dacapo.harness.Benchmark.run(Benchmark.java:237)
	at org.dacapo.harness.TestHarness.runBenchmark(TestHarness.java:225)
	at org.dacapo.harness.TestHarness.main(TestHarness.java:170)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at Harness.main(Unknown Source)
 (kafka.admin.TopicCommand$)
Finished
Digest validation failed for stdout.log, expecting 0xd054f76402d83d91bc9824746d89c489c458a103 found 0xbf2efb88a8d9c3a0c9e515527a815193e350c40b
===== DaCapo evaluation-git-5c4d976c kafka FAILED =====

Failure mode 2 (silent):

$ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -jar dacapo-evaluation-git-5c4d976c.jar -n 2 kafka
[...]
Using scaled threading model. 8 processors detected, 8 threads used to drive the workload, in a possible range of [1,unlimited]
Version: kafka 3.3.1
Nominal stats: AOA: 4|54, AOL: 4|48, AOM: 9|32, AOS: 3|16, ARA: 2|604, BAL: 2|1, BAS: 4|0, BEF: 8|3, BGF: 2|94, BPF: 2|25, BUB: 8|128, BUF: 8|21, GCA: 5|83, GCC: 10|153, GCM: 5|83, GCP: 1|0, GLK: 3|0, GMH: 9|203, GMU: 8|214, GSS: 1|0, GTO: 3|14, PET: 9|6, PPE: 2|3, PSD: 7|0, PWU: 8|1
Starting Zookeeper...
Starting Kafka Server...
===== DaCapo evaluation-git-5c4d976c kafka starting warmup 1 =====
Trogdor is running the workload....
Starting 1000000 requests...
100%
Completed requests
Finished
===== DaCapo evaluation-git-5c4d976c kafka completed warmup 1 in 12861 msec =====
===== DaCapo simple tail latency: 50% 21646 usec, 90% 116156 usec, 99% 366515 usec, 99.9% 379513 usec, 99.99% 381642 usec, max 382073 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 218681 usec, 90% 412230 usec, 99% 439030 usec, 99.9% 445226 usec, 99.99% 446288 usec, max 446702 usec, measured over 1000000 events =====
===== DaCapo evaluation-git-5c4d976c kafka starting =====
Trogdor is running the workload....
Starting 1000000 requests...
Completed requests
Finished
Digest validation failed for stdout.log, expecting 0xd054f76402d83d91bc9824746d89c489c458a103 found 0xda227bc97a9995dc07e939807ed6902ffbe12cb8
===== DaCapo evaluation-git-5c4d976c kafka FAILED =====
===== DaCapo simple tail latency: 50% 0 usec, 90% 0 usec, 99% 0 usec, 99.9% 0 usec, 99.99% 0 usec, max 0 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 0 usec, 90% 0 usec, 99% 0 usec, 99.9% 0 usec, 99.99% 0 usec, max 0 usec, measured over 1000000 events =====
Validation FAILED for kafka default

Normal execution on same hardware, parent commit:

$ /usr/lib/jvm/temurin-11-jdk-amd64/bin/java -jar dacapo-evaluation-git-2a2db73e.jar -n 2 kafka
[...]
Using scaled threading model. 8 processors detected, 8 threads used to drive the workload, in a possible range of [1,unlimited]
Version: kafka 3.3.1
Nominal stats: AOA: 4|54, AOL: 4|48, AOM: 9|32, AOS: 3|16, ARA: 2|604, BAL: 2|1, BAS: 4|0, BEF: 8|3, BGF: 2|94, BPF: 2|25, BUB: 8|128, BUF: 8|21, GCA: 5|83, GCC: 10|153, GCM: 5|83, GCP: 1|0, GLK: 3|0, GMH: 9|203, GMU: 8|214, GSS: 1|0, GTO: 3|14, PET: 9|6, PPE: 2|3, PSD: 7|0, PWU: 8|1
Starting Zookeeper...
Starting Kafka Server...
Starting Agent...
Starting Coordinator...
===== DaCapo evaluation-git-2a2db73e kafka starting warmup 1 =====
Trogdor is running the workload....
Starting 1000000 requests...
100%
Completed requests
Finished
===== DaCapo evaluation-git-2a2db73e kafka completed warmup 1 in 15436 msec =====
===== DaCapo simple tail latency: 50% 23190 usec, 90% 347557 usec, 99% 413725 usec, 99.9% 427890 usec, 99.99% 429745 usec, max 431983 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 185403 usec, 90% 458255 usec, 99% 480431 usec, 99.9% 484276 usec, 99.99% 485239 usec, max 485710 usec, measured over 1000000 events =====
===== DaCapo evaluation-git-2a2db73e kafka starting =====
Trogdor is running the workload....
Starting 1000000 requests...
100%
Completed requests
Finished
===== DaCapo evaluation-git-2a2db73e kafka PASSED in 5011 msec =====
===== DaCapo simple tail latency: 50% 10905 usec, 90% 20152 usec, 99% 31707 usec, 99.9% 36649 usec, 99.99% 37253 usec, max 37502 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 14715 usec, 90% 38993 usec, 99% 68202 usec, 99.9% 77444 usec, 99.99% 79865 usec, max 80119 usec, measured over 1000000 events =====
...Agent has completed.
...Coordinator has completed.

The problem seems to be this timeout. We'll need to find a robust solution to handling the race to delete the topics

Presumably the sensitivity to choice of hardware is reflecting a timing difference. The timeout is 1000ms; the problem manifests on slower machines, where this is presumably inadequate.

I have confirmed that the problem goes away on the same machine which reliably failed above by simply increasing the timeout:

git diff
diff --git a/benchmarks/bms/kafka/src/org/dacapo/kafka/ClientRunner.java b/benchmarks/bms/kafka/src/org/dacapo/kafka/ClientRunner.java
index b1c6b21c..e3fdaf76 100644
--- a/benchmarks/bms/kafka/src/org/dacapo/kafka/ClientRunner.java
+++ b/benchmarks/bms/kafka/src/org/dacapo/kafka/ClientRunner.java
@@ -44,7 +44,7 @@ public class ClientRunner{
         agentStarter.invoke(null, (Object) new String[]{"-c", agentConfig, "-n", "node0", "--exec", produceBench});
         topicCommand.invoke(null, (Object) new String[]{"--bootstrap-server", "localhost:9092", "--delete", "--topic", "dacapo-1,dacapo-2,dacapo-3,dacapo-4"});
         // Sleep one second waiting for the Kafka broker to delete the topics
-        Thread.sleep(1000);
+        Thread.sleep(2000);
         System.err.println("Finished");
     }

/usr/lib/jvm/temurin-11-jdk-amd64/bin/java -jar dacapo-evaluation-git-5c4d976c.jar -n 2 kafka
[...]
Using scaled threading model. 8 processors detected, 8 threads used to drive the workload, in a possible range of [1,unlimited]
Version: kafka 3.3.1
Nominal stats: AOA: 4|54, AOL: 4|48, AOM: 9|32, AOS: 3|16, ARA: 2|604, BAL: 2|1, BAS: 4|0, BEF: 8|3, BGF: 2|94, BPF: 2|25, BUB: 8|128, BUF: 8|21, GCA: 5|83, GCC: 10|153, GCM: 5|83, GCP: 1|0, GLK: 3|0, GMH: 9|203, GMU: 8|214, GSS: 1|0, GTO: 3|14, PET: 9|6, PPE: 2|3, PSD: 7|0, PWU: 8|1
Starting Zookeeper...
Starting Kafka Server...
===== DaCapo evaluation-git-5c4d976c kafka starting warmup 1 =====
Trogdor is running the workload....
Starting 1000000 requests...
100%
Completed requests
Finished
===== DaCapo evaluation-git-5c4d976c kafka completed warmup 1 in 14137 msec =====
===== DaCapo simple tail latency: 50% 24527 usec, 90% 352214 usec, 99% 456867 usec, 99.9% 475153 usec, 99.99% 477864 usec, max 480162 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 237372 usec, 90% 488577 usec, 99% 513809 usec, 99.9% 517961 usec, 99.99% 519397 usec, max 519972 usec, measured over 1000000 events =====
===== DaCapo evaluation-git-5c4d976c kafka starting =====
Trogdor is running the workload....
Starting 1000000 requests...
100%
Completed requests
Finished
===== DaCapo evaluation-git-5c4d976c kafka PASSED in 12448 msec =====
===== DaCapo simple tail latency: 50% 10327 usec, 90% 28660 usec, 99% 48697 usec, 99.9% 54340 usec, 99.99% 55158 usec, max 55416 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 16685 usec, 90% 46610 usec, 99% 69551 usec, 99.9% 78254 usec, 99.99% 80745 usec, max 80999 usec, measured over 1000000 events =====

Conversely, we can introduce the problem on the faster machines by making the timeout inadequate:

git diff
diff --git a/benchmarks/bms/kafka/src/org/dacapo/kafka/ClientRunner.java b/benchmarks/bms/kafka/src/org/dacapo/kafka/ClientRunner.java
index b1c6b21c..0d0ae855 100644
--- a/benchmarks/bms/kafka/src/org/dacapo/kafka/ClientRunner.java
+++ b/benchmarks/bms/kafka/src/org/dacapo/kafka/ClientRunner.java
@@ -44,7 +44,7 @@ public class ClientRunner{
         agentStarter.invoke(null, (Object) new String[]{"-c", agentConfig, "-n", "node0", "--exec", produceBench});
         topicCommand.invoke(null, (Object) new String[]{"--bootstrap-server", "localhost:9092", "--delete", "--topic", "dacapo-1,dacapo-2,dacapo-3,dacapo-4"});
         // Sleep one second waiting for the Kafka broker to delete the topics
-        Thread.sleep(1000);
+        Thread.sleep(1);
         System.err.println("Finished");
     }

/usr/lib/jvm/temurin-11-jdk-amd64/bin/java -jar dacapo-23.9-RC2-git-e3104f0e.jar kafka -n 2
[...]
Using scaled threading model. 24 processors detected, 24 threads used to drive the workload, in a possible range of [1,unlimited]
Version: kafka 3.3.1 (use -s for nominal benchmark stats)
Starting Zookeeper...
Starting Kafka Server...
===== DaCapo 23.9-RC2-git-e3104f0e kafka starting warmup 1 =====
Trogdor is running the workload....
Starting 1000000 requests...
100%
Completed requests
Finished
===== DaCapo 23.9-RC2-git-e3104f0e kafka completed warmup 1 in 6456 msec =====
===== DaCapo simple tail latency: 50% 7233 usec, 90% 18296 usec, 99% 149593 usec, 99.9% 151711 usec, 99.99% 152167 usec, max 152450 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 24134 usec, 90% 64558 usec, 99% 149593 usec, 99.9% 151711 usec, 99.99% 152167 usec, max 152450 usec, measured over 1000000 events =====
===== DaCapo 23.9-RC2-git-e3104f0e kafka starting =====
Trogdor is running the workload....
Starting 1000000 requests...
Completed requests
Finished
Digest validation failed for stdout.log, expecting 0xd054f76402d83d91bc9824746d89c489c458a103 found 0xda227bc97a9995dc07e939807ed6902ffbe12cb8
===== DaCapo 23.9-RC2-git-e3104f0e kafka FAILED =====
===== DaCapo simple tail latency: 50% 0 usec, 90% 0 usec, 99% 0 usec, 99.9% 0 usec, 99.99% 0 usec, max 0 usec, measured over 1000000 events =====
===== DaCapo metered tail latency: 50% 0 usec, 90% 0 usec, 99% 0 usec, 99.9% 0 usec, 99.99% 0 usec, max 0 usec, measured over 1000000 events =====
Validation FAILED for kafka default

I have committed 6ea164a, a workaround which does two things:

  • allows the user to use -f to specify an integer dilation factor, to increase the timeout.
  • moves the sleep outside the timing loop of the benchmark (independent issue discovered while addressing this)

I attempted to replace the sleep() with a polling call to the kafka admin to list the topics, only proceeding once the relevant topics were shown as deleted. To do this, I needed to expose getTopics(), so that I could check the list of topics.

Unfortunately, it seems that the topics are removed from the topic list immediately but deleted asynchronously, so simply checking the topic list is not sufficient; sleep() remains necessary.