vector::erase() performance bottleneck

Question

vector::erase() performance bottleneck

awelzel opened this issue 2 years ago · comments

In zeek/zeek#2394 a performance bottleneck pointing at vector::erase() was observed. The following test reproduced this by starting 16 workers and a manager, then produces 1mio Intel items in quick succession to simulate reading an Intel file.

Starting the attached script with zeek -j, the following observations are made:
(edit, this was with zeek version 5.1.0-dev.497)

Memory usage of manager increases from ~100MB to 574MB More or less depending on the number of workers. Dialing back to 4 workers, max memory usage is ~320MB.
After all intel items have been produced, the manager process runs at ~110% (fully utilized caf.thread) for ~110sec until CPU usage goes to normal. ~And there's another twist: Going back to normal means ~~10-15% instead of 0% or 1%. (which seems still high, given that without those publishes the manager is idling at 0% though that might be something else)~~
Naively computing the rate at which we send out the accumulated memory gives 4.3MB/s (474MB / 110sec) - that does not look great..
Flamegraph from the ~110% scenario usage - Notably, caf.thread is 48% of all samples with vector::erase being 43.0%: .

@bbannier and @timwoj suggested to just replace the vector with a dequeue. I can try to stare at this, but wonder if @Neverlord has other higher-level thoughts.

@load base/frameworks/cluster
@load base/frameworks/reporter

redef Broker::disable_ssl = T;

redef Reporter::info_to_stderr = T;
redef Reporter::warnings_to_stderr = T;
redef Reporter::errors_to_stderr = T;

global interface = "fuzzout";
global workers = 16;

event zeek_init()
    {
    if ( ! Supervisor::is_supervisor() )
        return;

    Broker::listen("127.0.0.1", 9999/tcp);

    local cluster: table[string] of Supervisor::ClusterEndpoint;
    cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
    cluster["logger"] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=10001/tcp];
    # cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10002/tcp];
    local i = 0;
    local worker_port_offset = 10010;
    while ( i < workers )
        {
        ++i;
        local name = fmt("worker-%03d", i);
        local p = count_to_port(worker_port_offset + i, tcp);
        cluster[name] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=p, $interface=interface];
        }



    for ( n, ep in cluster )
        {
        local sn = Supervisor::NodeConfig($name=n);
        sn$cluster = cluster;
        sn$directory = n;

        if ( ep?$interface )
            sn$interface = ep$interface;

        local res = Supervisor::create(sn);
        if ( res != "" )
            Reporter::fatal(fmt("supervisor failed to create node '%s': %s", n, res));
        }
    }

@if ( ! Supervisor::is_supervisor() )

@endif

@if ( Cluster::local_node_type() == Cluster::MANAGER )

@load base/frameworks/intel

# Uses internal Intel events..
module Intel;

# This fakes reading of an Intel file wth c entries.
event make_intel(c: count)
	{
	local i = 50000;  # batch size
	print "make intel", c;

	while (i > 0 && c > 0)
		{
		--i;
		--c;
		local indicator="192.168.0.1";
		local meta = [$source="fake"];
		local item = Intel::Item($indicator=indicator, $indicator_type=Intel::ADDR, $meta=meta);
		event Intel::new_item(item);
		}

	if ( c > 0 )
		event make_intel(c);
	}

global last_ps = get_proc_stats();
event stats() {
	local ps = get_proc_stats();
	local real_elapsed = interval_to_double(ps$real_time - last_ps$real_time);
	local user_elapsed = interval_to_double(ps$user_time - last_ps$user_time);
	local user_perc = user_elapsed * 100.0 / real_elapsed;

	last_ps = ps;

	# Mem is a bit useless, because it only shows maxrss
	print fmt("mem mb %.3f", ps$mem / 1024.0 / 1024.0), "real", interval_to_double(ps$real_time), fmt("user cpu usage %.1f%%", user_perc), "user", interval_to_double(ps$user_time);
	schedule 1sec { stats() };
}

global intel_scheduled = F;
event Cluster::node_up(name: string, id: string)
	{
	if ( intel_scheduled )
		return;

	schedule 0.1sec { stats() };
	print("Making intel in 4 seconds");
	schedule 4sec { make_intel(1000000) };
	intel_scheduled = T;
	}
@endif

Arne Welzel · Answer 1 · Fri Sep 09 2022 23:20:15 GMT+0800 (China Standard Time)

After all intel items have been produced, the manager process runs at ~110% (fully utilized caf.thread) for ~110sec until CPU usage goes to normal. And there's another twist: Going back to normal means ~10-15% instead of 0% or 1%. (which seems still high, given that without those publishes the manager is idling at 0% though that might be something else)

Uhh, with Zeek 4.2 this takes "only" 30-35seconds until CPU usage is back to normal. Peak memory usage is at ~900MB though. So something got 3-4x slower here. 30-35seconds still seems slow.

Arne Welzel · Answer 2 · Sat Sep 10 2022 01:36:40 GMT+0800 (China Standard Time)

The following change in caf to switch to std::deque reduces the time reported above for current master from ~110sec to a bit under 30 seconds. Manger CPU usage after intel items have been produced was lowered from 110% to ~60% to 70%.

diff --git a/libcaf_core/caf/flow/observable.hpp b/libcaf_core/caf/flow/observable.hpp
index 929dc1656..f64ac3d6d 100644
--- a/libcaf_core/caf/flow/observable.hpp
+++ b/libcaf_core/caf/flow/observable.hpp
@@ -621,12 +621,12 @@ public:
 
   explicit buffered_observable_impl(coordinator* ctx)
     : ctx_(ctx), desired_capacity_(defaults::flow::buffer_size) {
-    buf_.reserve(desired_capacity_);
+    // buf_.reserve(desired_capacity_);
   }
 
   buffered_observable_impl(coordinator* ctx, size_t desired_capacity)
     : ctx_(ctx), desired_capacity_(desired_capacity) {
-    buf_.reserve(desired_capacity_);
+    // buf_.reserve(desired_capacity_);
   }
 
   // -- implementation of disposable::impl -------------------------------------83767423cb39eb7a4eaf7a6df39bd84f9f003f4f
@@ -802,7 +802,7 @@ protected:
 
   coordinator* ctx_;
   size_t desired_capacity_;
-  std::vector<T> buf_;
+  std::deque<T> buf_;
   bool completed_ = false;
   size_t max_demand_ = 0;
   std::vector<output_t> outputs_;

Flamegraph now looks as follows, indicating more work in the caf.net.mpx thread:

Dominik Charousset · Answer 3 · Sun Sep 11 2022 21:59:41 GMT+0800 (China Standard Time)

Thanks for doing my homework and not only file an issue but provide a fix straightaway! 😄

I've found one more instance where deque makes more sense than vector. Mind double-checking it by setting your caf submodule to commit actor-framework/actor-framework@58b53ab?

Arne Welzel · Answer 4 · Fri Sep 23 2022 21:10:34 GMT+0800 (China Standard Time)

Hey @Neverlord - I've finally looked again. I took latest master and included the caf fix, compiled without ASAN and -O3.

With 16 workers, it takes ~30seconds for the manager to go back to idle after producing all 1mio intel files.
With the patch reverted and the same compile options I just measured another 90 seconds instead.

Do you think there might be any negative side effects of this change?

Dominik Charousset · Answer 5 · Sat Sep 24 2022 16:36:38 GMT+0800 (China Standard Time)

Do you think there might be any negative side effects of this change?

I don't think so. deque could end up making more memory allocations than a vector. However, all testing indicates that the copying is eating up performance (not allocations), so I'd go ahead with the changes.

Dominik Charousset · Answer 6 · Wed Jan 25 2023 00:40:11 GMT+0800 (China Standard Time)

@awelzel this should be solved since merging #284. Can we go ahead and close this ticket or did you observe another issue since using deque?

Arne Welzel · Answer 7 · Wed Jan 25 2023 03:30:26 GMT+0800 (China Standard Time)

I have not, but haven't specifically looked, either.