High caf thread utilization with 512 workers and 96 loggers

Question

High caf thread utilization with 512 workers and 96 loggers

awelzel opened this issue a year ago · comments

On the zeekorg Slack, users reported seeing high logger CPU usages in a cluster of 512 Zeek workers and 96 Zeek loggers (distributed over multiple physical systems).

A perf top and flamegraph of a single logging process indicates most time is spent in the caf layer, roughly ~72% in on_consumed_data(). The stacks might not be fully valid.

In contrast, Zeek's logging threads, librdkafka threads and lz4 compression represent ~10% of the profile to the right. This looks like a pathological case being triggered.

Arne Welzel · Answer 1 · Fri May 12 2023 23:46:35 GMT+0800 (China Standard Time)

The following supervisor setup starting 80 workers, 12 loggers and setting Broker::log_batch_interval down to 0.25, doing artificial logging to a single stream does provoke some overhead in the caf thread, also the on_consumed_data. I didn't have easy access to a system where I could run a 512 workers experiment, but maybe that gives something as a start.

# Run with zeek -j this-script.zeek

@load base/frameworks/cluster
@load base/frameworks/reporter

redef LogAscii::use_json = T;
redef Broker::disable_ssl = T;


redef Reporter::info_to_stderr = T;
redef Reporter::warnings_to_stderr = T;
redef Reporter::errors_to_stderr = T;

event do_create_node(sn: Supervisor::NodeConfig)
	{
	local res = Supervisor::create(sn);
	print "starting",  sn$name;

	if ( res != "" )
		print fmt("supervisor failed to create node '%s': %s", sn, res);
	}

event zeek_init()
	{
	# print Cluster::local_node_type();
	if ( ! Supervisor::is_supervisor() )
		return;

	Broker::listen("127.0.0.1", 9999/tcp);

	local cluster: table[string] of Supervisor::ClusterEndpoint;
	cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
	cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10001/tcp];

	local loggers = 12;
	local i = 0;
	while ( i < loggers )
		{
		++i;
		local lp = count_to_port(10010 + i, tcp);
		local logger_name = fmt("l-%d", i);
		cluster[logger_name] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=lp];
		}

	local workers = 80;
	local worker_port_offset = 10100;
	i = 0;
	while ( i < workers )
		{
		++i;
		local name = fmt("w-%d", i);
		# local p = count_to_port(worker_port_offset + i, tcp);
		cluster[name] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=0/tcp, $interface="lo"];
		}

	i = 0;
	for ( n, ep in cluster )
		{
		++i;
		local sn = Supervisor::NodeConfig($name=n);
		sn$cluster = cluster;
		sn$directory = n;
		sn$env = table(["ZEEK_DEFAULT_CONNECT_RETRY"] = "16");

		if ( ep?$interface )
			sn$interface = ep$interface;

		schedule double_to_interval(i * 0.1) { do_create_node(sn) };
		}
	}

@if ( ! Supervisor::is_supervisor() )

redef Broker::log_batch_interval = 0.25sec;

module TestLog;

redef enum Log::ID += { LOG };

type Info: record {
	ts: time &log &default=current_time();
	node: string &log &default=Cluster::node;
	code: count &log;
	status: string &log;
	rand: count &log &default=rand(10000);
};

event log_write() {
	Log::write(LOG, [$code=200, $status="Hello"]);
	schedule 0.01sec { log_write() };
}

event zeek_init()
	{
	Log::create_stream(LOG, [$path="my-http", $columns=Info]);
        if ( Cluster::local_node_type() == Cluster::WORKER )
	        schedule 0.01sec { log_write() };
	}
@endif

Arne Welzel · Answer 2 · Sat May 13 2023 15:51:32 GMT+0800 (China Standard Time)

On Slack, the user provided a new flamegraph that looks promising: There's ~55% time spent in std::max_element() within mcast::max_buffered().

https://github.com/zeek/actor-framework/blob/b50e2249761c2db5d28f94b64a6df92d0c3d0cc2/libcaf_core/caf/flow/op/mcast.hpp#L137-L147

Can probably be assumed that _states grew large and the linear searching for any of min/max/ buffered/demand ends up taking most CPU time.

Dominik Charousset · Answer 3 · Sat May 13 2023 16:02:59 GMT+0800 (China Standard Time)

Nice, thanks for doing all that digging. 👍

Arne Welzel · Answer 4 · Mon May 15 2023 23:34:38 GMT+0800 (China Standard Time)

I can somewhat reproduce this on a bigger machine with the original reproducer, but also trigger a high percentage of native_queued_spin_lock_slowpath on that system and don't have great symbols for libc. If there's a fix available that improves the op::mcast parts whould still be reasonable to test.

Dominik Charousset · Answer 5 · Tue May 16 2023 02:14:21 GMT+0800 (China Standard Time)

I can somewhat reproduce this on a bigger machine with the original reproducer

Thanks, Arne. That's good to know! I'll ping you when I have a patch ready to try.

Dominik Charousset · Answer 6 · Sun Jun 18 2023 20:17:06 GMT+0800 (China Standard Time)

@awelzel can you try topic/neverlord/gh-352? The branch contains the Broker-side fix for #366 plus a CAF-side patch to address frequent max_element calls.

I've written a small benchmark that pushes 1k messages (integers) to up to 1024 observers. With the patch, the runtime drops significantly:

Hopefully the synthetic benchmark also translates to real results in Broker.

Arne Welzel · Answer 7 · Thu Jun 22 2023 21:33:41 GMT+0800 (China Standard Time)

I've re-tested master with 196-workers, 16 loggers and a log_batch_interval of 0.10sec.

The workers script was tweaked to write nlogger messages every log_batch_interval to two log streams.

This results in ~65% CPU usage per logger process with the following break-down / flamegraph - most overhead is in the finalize() -> on_consumed_some() path.

Same reproducer with branch topic/neverlord/gh-352 reduced CPU usage of logger process to ~35%. Flamegraph as follows, the overhead in on_consumed_some() is not apparent anymore. Nice job!

Arne Welzel · Answer 8 · Wed Jun 28 2023 16:36:49 GMT+0800 (China Standard Time)

External user has reported that logger CPU usage decreased significantly:

Single logger(16c64g) cpu usage decreased from 76% to 24% .

However, they have observed two occurrences were worker memory increased rapidly from 16GB to 64GB after running seemingly stable for ~20hours. Have requested more telemetry/metrics of the involved logger processes.

Flamegraph one one of their loggers after deploy the patch does not show max_element().

Arne Welzel · Answer 9 · Wed Jun 28 2023 18:53:06 GMT+0800 (China Standard Time)

From Slack, suspicion is that loggers stop accepting log messages causing workers to run OOM. User reports this is not happening with their vanilla 5.2.0 deployment.

Thanks for the details. CPU usage of logger plummets around the time when worker memory increases rapidly, so one suspicion is that loggers stop processing log messages and workers queue them on their end. Will need to see if this can be reproduced in a testing setup.

This triggered in the user's environment only after ~20 hours with a pretty large cluster under production load, so not sure what the chances are to reproduce this in a testing environment, but we could give it a shot.

(https://zeekorg.slack.com/archives/CSZBXF6TH/p1687920539949679?thread_ts=1683797785.775629&cid=CSZBXF6TH)

Dominik Charousset · Answer 10 · Thu Jun 29 2023 01:50:26 GMT+0800 (China Standard Time)

Thank you for taking care of playing this back to the reporter. 🙂

Let me try to reproduce this bug with a stress test. I'll ping you if I have a new version to test.