zeek / broker

Zeek's Messaging Library

Home Page:https://docs.zeek.org/projects/broker

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High caf thread utilization with 512 workers and 96 loggers

awelzel opened this issue · comments

On the zeekorg Slack, users reported seeing high logger CPU usages in a cluster of 512 Zeek workers and 96 Zeek loggers (distributed over multiple physical systems).

A perf top and flamegraph of a single logging process indicates most time is spent in the caf layer, roughly ~72% in on_consumed_data(). The stacks might not be fully valid.

current_perf

In contrast, Zeek's logging threads, librdkafka threads and lz4 compression represent ~10% of the profile to the right. This looks like a pathological case being triggered.

The following supervisor setup starting 80 workers, 12 loggers and setting Broker::log_batch_interval down to 0.25, doing artificial logging to a single stream does provoke some overhead in the caf thread, also the on_consumed_data. I didn't have easy access to a system where I could run a 512 workers experiment, but maybe that gives something as a start.

broker-caf-on-consumed-data

# Run with zeek -j this-script.zeek

@load base/frameworks/cluster
@load base/frameworks/reporter

redef LogAscii::use_json = T;
redef Broker::disable_ssl = T;


redef Reporter::info_to_stderr = T;
redef Reporter::warnings_to_stderr = T;
redef Reporter::errors_to_stderr = T;

event do_create_node(sn: Supervisor::NodeConfig)
	{
	local res = Supervisor::create(sn);
	print "starting",  sn$name;

	if ( res != "" )
		print fmt("supervisor failed to create node '%s': %s", sn, res);
	}

event zeek_init()
	{
	# print Cluster::local_node_type();
	if ( ! Supervisor::is_supervisor() )
		return;

	Broker::listen("127.0.0.1", 9999/tcp);

	local cluster: table[string] of Supervisor::ClusterEndpoint;
	cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
	cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10001/tcp];

	local loggers = 12;
	local i = 0;
	while ( i < loggers )
		{
		++i;
		local lp = count_to_port(10010 + i, tcp);
		local logger_name = fmt("l-%d", i);
		cluster[logger_name] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=lp];
		}

	local workers = 80;
	local worker_port_offset = 10100;
	i = 0;
	while ( i < workers )
		{
		++i;
		local name = fmt("w-%d", i);
		# local p = count_to_port(worker_port_offset + i, tcp);
		cluster[name] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=0/tcp, $interface="lo"];
		}

	i = 0;
	for ( n, ep in cluster )
		{
		++i;
		local sn = Supervisor::NodeConfig($name=n);
		sn$cluster = cluster;
		sn$directory = n;
		sn$env = table(["ZEEK_DEFAULT_CONNECT_RETRY"] = "16");

		if ( ep?$interface )
			sn$interface = ep$interface;

		schedule double_to_interval(i * 0.1) { do_create_node(sn) };
		}
	}

@if ( ! Supervisor::is_supervisor() )

redef Broker::log_batch_interval = 0.25sec;

module TestLog;

redef enum Log::ID += { LOG };

type Info: record {
	ts: time &log &default=current_time();
	node: string &log &default=Cluster::node;
	code: count &log;
	status: string &log;
	rand: count &log &default=rand(10000);
};

event log_write() {
	Log::write(LOG, [$code=200, $status="Hello"]);
	schedule 0.01sec { log_write() };
}

event zeek_init()
	{
	Log::create_stream(LOG, [$path="my-http", $columns=Info]);
        if ( Cluster::local_node_type() == Cluster::WORKER )
	        schedule 0.01sec { log_write() };
	}
@endif

On Slack, the user provided a new flamegraph that looks promising: There's ~55% time spent in std::max_element() within mcast::max_buffered().

https://github.com/zeek/actor-framework/blob/b50e2249761c2db5d28f94b64a6df92d0c3d0cc2/libcaf_core/caf/flow/op/mcast.hpp#L137-L147

Can probably be assumed that _states grew large and the linear searching for any of min/max/ buffered/demand ends up taking most CPU time.

perf_20230513102803

Nice, thanks for doing all that digging. 👍

I can somewhat reproduce this on a bigger machine with the original reproducer, but also trigger a high percentage of native_queued_spin_lock_slowpath on that system and don't have great symbols for libc. If there's a fix available that improves the op::mcast parts whould still be reasonable to test.

8-loggers-196-workers

I can somewhat reproduce this on a bigger machine with the original reproducer

Thanks, Arne. That's good to know! I'll ping you when I have a patch ready to try.

@awelzel can you try topic/neverlord/gh-352? The branch contains the Broker-side fix for #366 plus a CAF-side patch to address frequent max_element calls.

I've written a small benchmark that pushes 1k messages (integers) to up to 1024 observers. With the patch, the runtime drops significantly:

Screenshot 2023-06-18 at 14 14 03

Hopefully the synthetic benchmark also translates to real results in Broker.

I've re-tested master with 196-workers, 16 loggers and a log_batch_interval of 0.10sec.

The workers script was tweaked to write nlogger messages every log_batch_interval to two log streams.

This results in ~65% CPU usage per logger process with the following break-down / flamegraph - most overhead is in the finalize() -> on_consumed_some() path.
master

Same reproducer with branch topic/neverlord/gh-352 reduced CPU usage of logger process to ~35%. Flamegraph as follows, the overhead in on_consumed_some() is not apparent anymore. Nice job!

196-16-gh-352

External user has reported that logger CPU usage decreased significantly:

Single logger(16c64g) cpu usage decreased from 76% to 24% .

However, they have observed two occurrences were worker memory increased rapidly from 16GB to 64GB after running seemingly stable for ~20hours. Have requested more telemetry/metrics of the involved logger processes.

Flamegraph one one of their loggers after deploy the patch does not show max_element().
perf

From Slack, suspicion is that loggers stop accepting log messages causing workers to run OOM. User reports this is not happening with their vanilla 5.2.0 deployment.

Thanks for the details. CPU usage of logger plummets around the time when worker memory increases rapidly, so one suspicion is that loggers stop processing log messages and workers queue them on their end. Will need to see if this can be reproduced in a testing setup.

This triggered in the user's environment only after ~20 hours with a pretty large cluster under production load, so not sure what the chances are to reproduce this in a testing environment, but we could give it a shot.

(https://zeekorg.slack.com/archives/CSZBXF6TH/p1687920539949679?thread_ts=1683797785.775629&cid=CSZBXF6TH)

Thank you for taking care of playing this back to the reporter. 🙂

Let me try to reproduce this bug with a stress test. I'll ping you if I have a new version to test.