zeek / broker

Zeek's Messaging Library

Home Page:https://docs.zeek.org/projects/broker

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

vector::erase() performance bottleneck

awelzel opened this issue · comments

In zeek/zeek#2394 a performance bottleneck pointing at vector::erase() was observed. The following test reproduced this by starting 16 workers and a manager, then produces 1mio Intel items in quick succession to simulate reading an Intel file.

Starting the attached script with zeek -j, the following observations are made:
(edit, this was with zeek version 5.1.0-dev.497)

  • Memory usage of manager increases from ~100MB to 574MB More or less depending on the number of workers. Dialing back to 4 workers, max memory usage is ~320MB.
  • After all intel items have been produced, the manager process runs at ~110% (fully utilized caf.thread) for ~110sec until CPU usage goes to normal. ~And there's another twist: Going back to normal means 10-15% instead of 0% or 1%. (which seems still high, given that without those publishes the manager is idling at 0% though that might be something else)
  • Naively computing the rate at which we send out the accumulated memory gives 4.3MB/s (474MB / 110sec) - that does not look great..
  • Flamegraph from the ~110% scenario usage - Notably, caf.thread is 48% of all samples with vector::erase being 43.0%: manager.

@bbannier and @timwoj suggested to just replace the vector with a dequeue. I can try to stare at this, but wonder if @Neverlord has other higher-level thoughts.

@load base/frameworks/cluster
@load base/frameworks/reporter

redef Broker::disable_ssl = T;

redef Reporter::info_to_stderr = T;
redef Reporter::warnings_to_stderr = T;
redef Reporter::errors_to_stderr = T;

global interface = "fuzzout";
global workers = 16;

event zeek_init()
    {
    if ( ! Supervisor::is_supervisor() )
        return;

    Broker::listen("127.0.0.1", 9999/tcp);

    local cluster: table[string] of Supervisor::ClusterEndpoint;
    cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
    cluster["logger"] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=10001/tcp];
    # cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10002/tcp];
    local i = 0;
    local worker_port_offset = 10010;
    while ( i < workers )
        {
        ++i;
        local name = fmt("worker-%03d", i);
        local p = count_to_port(worker_port_offset + i, tcp);
        cluster[name] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=p, $interface=interface];
        }



    for ( n, ep in cluster )
        {
        local sn = Supervisor::NodeConfig($name=n);
        sn$cluster = cluster;
        sn$directory = n;

        if ( ep?$interface )
            sn$interface = ep$interface;

        local res = Supervisor::create(sn);
        if ( res != "" )
            Reporter::fatal(fmt("supervisor failed to create node '%s': %s", n, res));
        }
    }

@if ( ! Supervisor::is_supervisor() )

@endif

@if ( Cluster::local_node_type() == Cluster::MANAGER )

@load base/frameworks/intel

# Uses internal Intel events..
module Intel;

# This fakes reading of an Intel file wth c entries.
event make_intel(c: count)
	{
	local i = 50000;  # batch size
	print "make intel", c;

	while (i > 0 && c > 0)
		{
		--i;
		--c;
		local indicator="192.168.0.1";
		local meta = [$source="fake"];
		local item = Intel::Item($indicator=indicator, $indicator_type=Intel::ADDR, $meta=meta);
		event Intel::new_item(item);
		}

	if ( c > 0 )
		event make_intel(c);
	}

global last_ps = get_proc_stats();
event stats() {
	local ps = get_proc_stats();
	local real_elapsed = interval_to_double(ps$real_time - last_ps$real_time);
	local user_elapsed = interval_to_double(ps$user_time - last_ps$user_time);
	local user_perc = user_elapsed * 100.0 / real_elapsed;

	last_ps = ps;

	# Mem is a bit useless, because it only shows maxrss
	print fmt("mem mb %.3f", ps$mem / 1024.0 / 1024.0), "real", interval_to_double(ps$real_time), fmt("user cpu usage %.1f%%", user_perc), "user", interval_to_double(ps$user_time);
	schedule 1sec { stats() };
}

global intel_scheduled = F;
event Cluster::node_up(name: string, id: string)
	{
	if ( intel_scheduled )
		return;

	schedule 0.1sec { stats() };
	print("Making intel in 4 seconds");
	schedule 4sec { make_intel(1000000) };
	intel_scheduled = T;
	}
@endif

  • After all intel items have been produced, the manager process runs at ~110% (fully utilized caf.thread) for ~110sec until CPU usage goes to normal. And there's another twist: Going back to normal means ~10-15% instead of 0% or 1%. (which seems still high, given that without those publishes the manager is idling at 0% though that might be something else)

Uhh, with Zeek 4.2 this takes "only" 30-35seconds until CPU usage is back to normal. Peak memory usage is at ~900MB though. So something got 3-4x slower here. 30-35seconds still seems slow.

The following change in caf to switch to std::deque reduces the time reported above for current master from ~110sec to a bit under 30 seconds. Manger CPU usage after intel items have been produced was lowered from 110% to ~60% to 70%.

diff --git a/libcaf_core/caf/flow/observable.hpp b/libcaf_core/caf/flow/observable.hpp
index 929dc1656..f64ac3d6d 100644
--- a/libcaf_core/caf/flow/observable.hpp
+++ b/libcaf_core/caf/flow/observable.hpp
@@ -621,12 +621,12 @@ public:
 
   explicit buffered_observable_impl(coordinator* ctx)
     : ctx_(ctx), desired_capacity_(defaults::flow::buffer_size) {
-    buf_.reserve(desired_capacity_);
+    // buf_.reserve(desired_capacity_);
   }
 
   buffered_observable_impl(coordinator* ctx, size_t desired_capacity)
     : ctx_(ctx), desired_capacity_(desired_capacity) {
-    buf_.reserve(desired_capacity_);
+    // buf_.reserve(desired_capacity_);
   }
 
   // -- implementation of disposable::impl -------------------------------------83767423cb39eb7a4eaf7a6df39bd84f9f003f4f
@@ -802,7 +802,7 @@ protected:
 
   coordinator* ctx_;
   size_t desired_capacity_;
-  std::vector<T> buf_;
+  std::deque<T> buf_;
   bool completed_ = false;
   size_t max_demand_ = 0;
   std::vector<output_t> outputs_;

Flamegraph now looks as follows, indicating more work in the caf.net.mpx thread:
manager-deque

Thanks for doing my homework and not only file an issue but provide a fix straightaway! 😄

I've found one more instance where deque makes more sense than vector. Mind double-checking it by setting your caf submodule to commit actor-framework/actor-framework@58b53ab?

Hey @Neverlord - I've finally looked again. I took latest master and included the caf fix, compiled without ASAN and -O3.

  • With 16 workers, it takes ~30seconds for the manager to go back to idle after producing all 1mio intel files.

  • With the patch reverted and the same compile options I just measured another 90 seconds instead.

Do you think there might be any negative side effects of this change?

Do you think there might be any negative side effects of this change?

I don't think so. deque could end up making more memory allocations than a vector. However, all testing indicates that the copying is eating up performance (not allocations), so I'd go ahead with the changes.

@awelzel this should be solved since merging #284. Can we go ahead and close this ticket or did you observe another issue since using deque?

I have not, but haven't specifically looked, either.