[BUG] - KoP CPU usage is 10 times higher than Pulsar on the broker side
Hongten opened this issue · comments
Describe the bug
KoP CUP usage is 10 times higher than Pulsar on the broker side
To Reproduce
Steps to reproduce the behavior:
- Create a topic with 100 partitions.
- Use a pulsar client to produce data on the topic, and record the broker CPU usage.
- Use a Kafka client to produce data on the topic, and record the broker CPU usage.
Expected behavior
KoP CPU usage should be greater than or equal to the Pulsar itself.
In this case, when I use the Pulsar client, the broker CPU usage is around 3. When I use Kafka client, the broker CPU usage is around 30.
Expected result: When I use Kafka client, the broker CPU usage >=3, but not too high.
Screenshots
The broker(KoP) flame graph
Pulsar client vs. Kafka client(CUP usage)
Additional context
Add any other context about the problem here.
Please refer to the flame graph HTML file.
broker_flame_graph.html.zip
What's your KoP version? It looks like the sanitizeMetricName
, which converts a metric name to a valid name, took too much time.
However, the io.prometheus:simpleclient:0.16.0
dependency of KoP master does not call regex related APIs:
public static String sanitizeMetricName(String metricName) {
int length = metricName.length();
char[] sanitized = new char[length];
for(int i = 0; i < length; ++i) {
char ch = metricName.charAt(i);
if (ch != ':' && (ch < 'a' || ch > 'z') && (ch < 'A' || ch > 'Z') && (i <= 0 || ch < '0' || ch > '9')) {
sanitized[i] = '_';
} else {
sanitized[i] = ch;
}
}
return new String(sanitized);
}
KOP- 2.10.1.11
Pulsar- 2.10.1
It seems to be a bug of Pulsar itself. I checked the prometheus dependency of Pulsar 2.10.2, the dependency is still 0.5.0, which has a inefficient implementation of sanitizeMetricName
.
apache-pulsar-2.10.2$ ls lib | grep prometheus
io.prometheus-simpleclient-0.5.0.jar
The prometheus dependency was upgraded from apache/pulsar#13785, which cannot be cherry-picked to branch-2.10. I think we need to implement the method manually to avoid the upgrade.
hmm, I double-check the pulsar(broker) lib folder and find the io.prometheus-simpleclient
is 0.5.0.
~ ll | grep prometheus
-rwxrwxrwx 1 root root 59175 Jan 22 2020 io.prometheus-simpleclient-0.5.0.jar*
-rwxrwxrwx 1 root root 5575 Jan 22 2020 io.prometheus-simpleclient_caffeine-0.5.0.jar*
-rwxrwxrwx 1 root root 5838 Jan 22 2020 io.prometheus-simpleclient_common-0.5.0.jar*
-rwxrwxrwx 1 root root 18245 Jan 22 2020 io.prometheus-simpleclient_hotspot-0.5.0.jar*
-rwxrwxrwx 1 root root 9517 Jan 22 2020 io.prometheus-simpleclient_httpserver-0.5.0.jar*
-rwxrwxrwx 1 root root 5177 Jan 22 2020 io.prometheus-simpleclient_jetty-0.5.0.jar*
-rwxrwxrwx 1 root root 4583 Jan 22 2020 io.prometheus-simpleclient_log4j2-0.5.0.jar*
-rwxrwxrwx 1 root root 7103 Jan 22 2020 io.prometheus-simpleclient_servlet-0.5.0.jar*
-rwxrwxrwx 1 root root 29885 Jan 22 2020 io.prometheus.jmx-collector-0.14.0.jar*
-rwxrwxrwx 1 root root 30649 Jan 22 2020 org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.5.jar*
-rwxrwxrwx 1 root root 16427 Jan 22 2020 org.apache.zookeeper-zookeeper-prometheus-metrics-3.6.3.jar*
After I have upgraded(I cherry-pick the code from this MR - apache/pulsar#13785) the Prometheus client version from 0.5.0 to 0.15.0, the CPU usage still exceeds 30 and the Prometheus-related code costs more 24% than before(0.5.0).
P1 - The CPU usage with Prometheus version 0.5.0
P2 - The CPU usage with Prometheus version 0.15.0.
P3 - Prometheus version 0.5.0. The Prometheus-related code costs 37.93% CPU.
P4 - Prometheus version 0.15.0. The Prometheus-related code costs 62.13% CPU.
The broker Prometheus libs
ll | grep prometheus
-rwxrwxrwx 1 root root 89127 Jan 22 2020 io.prometheus-simpleclient-0.15.0.jar*
-rwxrwxrwx 1 root root 5556 Jan 22 2020 io.prometheus-simpleclient_caffeine-0.15.0.jar*
-rwxrwxrwx 1 root root 8008 Jan 22 2020 io.prometheus-simpleclient_common-0.15.0.jar*
-rwxrwxrwx 1 root root 24094 Jan 22 2020 io.prometheus-simpleclient_hotspot-0.15.0.jar*
-rwxrwxrwx 1 root root 14456 Jan 22 2020 io.prometheus-simpleclient_httpserver-0.15.0.jar*
-rwxrwxrwx 1 root root 5200 Jan 22 2020 io.prometheus-simpleclient_jetty-0.15.0.jar*
-rwxrwxrwx 1 root root 4595 Jan 22 2020 io.prometheus-simpleclient_log4j2-0.15.0.jar*
-rwxrwxrwx 1 root root 88445 Jan 22 2020 io.prometheus-simpleclient_servlet-0.15.0.jar*
-rwxrwxrwx 1 root root 12941 Jan 22 2020 io.prometheus-simpleclient_servlet_common-0.15.0.jar*
-rwxrwxrwx 1 root root 3378 Jan 22 2020 io.prometheus-simpleclient_tracer_common-0.15.0.jar*
-rwxrwxrwx 1 root root 4272 Jan 22 2020 io.prometheus-simpleclient_tracer_otel-0.15.0.jar*
-rwxrwxrwx 1 root root 4537 Jan 22 2020 io.prometheus-simpleclient_tracer_otel_agent-0.15.0.jar*
-rwxrwxrwx 1 root root 31808 Jan 22 2020 io.prometheus.jmx-collector-0.16.1.jar*
-rwxrwxrwx 1 root root 30649 Jan 22 2020 org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.5.jar*
-rwxrwxrwx 1 root root 16427 Jan 22 2020 org.apache.zookeeper-zookeeper-prometheus-metrics-3.6.3.jar*
The sanitizeMetricName
definition was defined when building KoP. It is not loaded dynamically so upgrading the broker does not work. You should upgrade the KoP.
Summary:
Upgrade the Prometheus version to 0.16.0 in both Pulsar and KoP. The performance can be improved by around 37%.
After upgrading the Prometheus version to 0.16.0 in both Pulsar and KoP. The Prometheus-related code CPU usage is 0.58% now based on the Flame graph. The Broker's CPU usage reduces by around 16.3%. (Before 33%, Now 16.7%)
Summary:
The KoP-related code CPU usage can improve by around 10.75% if we set entryFormat=kafka
(The default entryFormat=pulsar
.)
P1 - entryFormat=pulsar
, the kop-related code CPU usage is around 50.37%. Almost CPU usages are on the two methods:PulsarEntryFormatter.encode()
and ByteBufUtils.decodePulsarEntryToKafkaRecords()
P2 - entryFormat=kafka
, the kop-related code CPU usage is around 39.62%.
P3 - The broker CPU usage seems no change. The accepted result is the CPU usage of entryForm=kafka
is smaller than entryFormat=pulsar
. This may be caused by the different partition numbers in one broker after the broker restart.