Metrics scalability performance issue in 6.0

Question

Metrics scalability performance issue in 6.0

awelzel opened this issue a year ago · comments

This has been reported during Zeek 6.0 rc1/rc2 testing as manager memory growing steadily until crashing due to OOM.

Further, when manager's memory usage is growing, it does not serve requests to curl http://localhost:9911/metrics anymore. The connection is accepted but no response is produced.

The following supervisor setup reproduces the issue here. It creates two counter families with 200 counter instances for each family. With 8 workers and 3 other processes, this results in enough metrics to cause manager overload.

Either increasing the number of workers or increasing nworkers or ncounters or nfamilies should reproduce the issue if you have a more powerful system.

Deployments with 64 or 128 workers may trigger this observation even if the number of instances per metrics is small (10 or 20) as it's multiplied by the number of endpoints if I read the code right. Let alone setups with 500+ workers (#352).

On the Zeek side we could consider removing the metrics for log streams/writers and/or event invocations for 6.0 or reduce the synchronization interval, but this seems mostly like a performance bug.

Further down the road, it may also be more efficient to query workers on-demand rather than having workers publish their metrics every second (which most of the time will be overwritten again) and the expense of a small delay.

$ cat my-supervisor.zeek
@load base/frameworks/cluster
@load base/frameworks/reporter

redef LogAscii::use_json = T;
redef Broker::disable_ssl = F;


redef Reporter::info_to_stderr = T;
redef Reporter::warnings_to_stderr = T;
redef Reporter::errors_to_stderr = T;

global nworkers = 8;

event zeek_init()
        {
        # print Cluster::local_node_type();
        if ( ! Supervisor::is_supervisor() )
                return;

        Broker::listen("127.0.0.1", 9999/tcp);

        local cluster: table[string] of Supervisor::ClusterEndpoint;
        cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
        cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10001/tcp];
        cluster["logger"] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=10002/tcp];

        local worker_port_offset = 10100;
        local i = 0;
        while ( i < nworkers )
                {
                ++i;
                local name = fmt("worker-%03d", i);
                cluster[name] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=0/tcp, $interface="lo"];
                }

        for ( n, ep in cluster )
                {
                local sn = Supervisor::NodeConfig($name=n);
                sn$cluster = cluster;
                sn$directory = n;
                sn$env = table(["ZEEK_DEFAULT_CONNECT_RETRY"] = "1");

                if ( ep?$interface )
                        sn$interface = ep$interface;

                print "starting",  sn$name;
                local res = Supervisor::create(sn);
                if ( res != "" )
                        print fmt("supervisor failed to create node '%s': %s", sn, res);
                }
        }

@if ( ! Supervisor::is_supervisor() )
@load ./telemetry.zeek
@endif


$ cat telemetry.zeek 
redef Broker::disable_ssl = T;
global update_interval: interval = 1sec;

global nfamilies = 2;
global ncounters_per_family = 200;

type Counters: record {
  f: Telemetry::CounterFamily;
  counters: vector of Telemetry::Counter;
};


global my_families: vector of Counters;
global counters: vector of Telemetry::Counter;

event update_telemetry() {
        schedule update_interval { update_telemetry() };

        for ( _, f in my_families ) {
                for ( _, c in f$counters ) {
                        Telemetry::counter_inc(c, rand(10));
                }
        }
}

event zeek_init() {
        local i = 0;
        while ( i < nfamilies ) {
                local f = Counters(
                        $f=Telemetry::register_counter_family([
                                $prefix="zeek",
                                $name="test",
                                $unit="stuff",
                                $help_text=fmt("stuff %d", i),
                                $labels=vector("label1", "label2"),
                        ]),
                        $counters=vector(),
                );
                my_families[i] = f;
                local j = 0;
                while ( j < ncounters_per_family ) {
                        local labels = vector(cat(i), cat(j));
                        f$counters += Telemetry::counter_with(f$f, labels);
                        ++j;
                }
                ++i;
        }

        schedule update_interval { update_telemetry() };

}

Christian Kreibich · Answer 1 · Fri Jun 16 2023 03:34:22 GMT+0800 (China Standard Time)

On the Zeek side we could consider removing the metrics for log streams/writers and/or event invocations for 6.0 or reduce the synchronization interval, but this seems mostly like a performance bug.

Agreed — I'm relieved because we have so many options here. The find_if() looks expensive because it's implementing an identity comparison for a set of labels (yes?) so we should first see if we could optimize that. My next suggestion would be to dial down the synchronization interval (and make it configurable, if possible), and as a last resort remove the new metrics. Thoughts on this welcome, of course.

I also like the idea of implementing proper request-driven scraping in 6.1 instead of constant push from all nodes to the manager.

Christian Kreibich · Answer 2 · Fri Jun 16 2023 04:16:10 GMT+0800 (China Standard Time)

@Neverlord would be great to hear your thoughts here — could metric_scope gain a map to point directly from labels to a matching instance, so we can drop the vector scan? Or could instances become a map itself?

Dominik Charousset · Answer 3 · Fri Jun 16 2023 14:37:50 GMT+0800 (China Standard Time)

I'll look into it.

Dominik Charousset · Answer 4 · Fri Jun 16 2023 14:52:45 GMT+0800 (China Standard Time)

I also like the idea of implementing proper request-driven scraping in 6.1 instead of constant push from all nodes to the manager.

I'm not convinced that this is a good route for a distributed system with loose coupling like Broker. At the pub/sub layer, we don't know how many responses we should expect. There is no central registry of metric sources. Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation. A loosely coupled push model like we have now is much more robust.

If the once-per-second updates introduce significant traffic, I think we can instead optimize that update. I haven't looked at the actual metrics yet, but are all workers actually using all the metrics? Maybe we could emit metrics with zero values or use some sort of "compression" / better encoding.

But let's fix the obvious performance bugs first. 🙂

Dominik Charousset · Answer 5 · Sat Jun 17 2023 17:08:04 GMT+0800 (China Standard Time)

Agreed — I'm relieved because we have so many options here. The find_if() looks expensive because it's implementing an identity comparison for a set of labels (yes?) so we should first see if we could optimize that. My next suggestion would be to dial down the synchronization interval (and make it configurable, if possible), and as a last resort remove the new metrics.
...
Or could instances become a map itself?

Full ACK on the strategy. 👍

PR #367 is getting rid of the find_if and treats instances like a map (via std::lower_bound lookups). Since the vector is just a list of pointers, std::lower_bound on a sorted vector should be even faster than an actual std::map (because the map has to do pointer chasing on a tree). Let's see if this improves performance enough.

The interval is already configuration via the broker.metrics.export.interval option (BROKER_METRICS_EXPORT_INTERVAL as environment variable and Broker::metrics_export_interval from Zeek scripts). So we can fine-tune that as well if necessary.

Arne Welzel · Answer 6 · Thu Jun 29 2023 19:48:52 GMT+0800 (China Standard Time)

I'm not convinced that this is a good route for a distributed system with loose coupling like Broker. At the pub/sub layer, we don't know how many responses we should expect.

On the Zeek level the nodes we expect metrics from are fixed. All nodes should also have consistent metrics (types, help, labels). In fact, in much larger setups it might be best to forego the centralization aspect of either push based or pull based altogether and use configuration management to set Prometheus up for individual nodes accordingly.

Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation.

That seems quite okay and pragmatic. If a node fails to provide metrics (in 1 or 2 seconds) then it timed-out and that's a signal, too.

I have prototyped the request-response/pull-based approach here: https://github.com/awelzel/zeek-js-metrics

This triggers metric collection as Zeek events over broker and collects the results (pre-rendered Prometheus lines) before replying back to an HTTP request handled in JavaScript on the manager.

With 24 workers and a large number of metrics there is zero overhead or extra cluster communication when no scraping happens and significant lower overhead on the manager when scraping happens at 1 second intervals (still high, but to comparable with broker's default). In this artificial setup, broker's centralization causes the manager to use 30% CPU by default, for the pull approach usage is only at 10%.

This doesn't necessarily mean we should require JS for this, but I think it's reasonable to use it to compare approaches and expectations.

Dominik Charousset · Answer 7 · Thu Jun 29 2023 23:36:26 GMT+0800 (China Standard Time)

Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation.

That seems quite okay and pragmatic. If a node fails to provide metrics (in 1 or 2 seconds) then it timed-out and that's a signal, too.

I disagree with this. Very much.

With opening up the system via WebSocket and ultimately via ALM, we are no longer limited to the rigid structure of a Zeek cluster. To me, this is not a pragmatic solution. On the contrary, it will tie down Broker and directly contradicts many design decisions and idioms when designing a loosely coupled, distributed system.

It also brings more issues with it. Tying up a scraper for up to 1-2s because a single node is lagging behind is unacceptable. They probably time out at that point, ultimately causing the scraper to have no metrics at all instead of at least having the n-1 metrics.

I very much appreciate your efforts in quantifying the problem. But please let's not commit to problematic workarounds that compromise the system architecture and come with their own bag of problems. Let's fix the actual problem here: poor performance in Broker. This is purely a performance bug. Currently, the metrics are heavily nested. Broker is really, really bad at efficiently handling this (unfortunately). #368 could be a big part of a solution here, to make this simply a non-issue. Pre-rendering the metrics instead of shipping "neat" broker::data is also an option.

The central collection of metrics was something I've put together a while ago after some internal discussion that this would be nice to have. Then it basically didn't get used until you tried it after adding hundreds of metrics to Zeek. My design was assuming maybe a couple dozen metrics per node. Let me fix that. 🙂

We have already disabled central collection by default again, right? Ist this still something we would consider urgent?

Arne Welzel · Answer 8 · Thu Jun 29 2023 23:57:18 GMT+0800 (China Standard Time)

We have already disabled central collection by default again, right?

Yes, it's disabled.

Benjamin Bannier · Answer 9 · Thu Jul 06 2023 16:40:24 GMT+0800 (China Standard Time)

I have been observing this discussion from the sidelines, just a few comments.

Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation.

That seems quite okay and pragmatic. If a node fails to provide metrics (in 1 or 2 seconds) then it timed-out and that's a signal, too.

I disagree with this. Very much.

With opening up the system via WebSocket and ultimately via ALM, we are no longer limited to the rigid structure of a Zeek cluster. To me, this is not a pragmatic solution. On the contrary, it will tie down Broker and directly contradicts many design decisions and idioms when designing a loosely coupled, distributed system.

For me reading of metrics from the manager is a convenience feature for users who do not want to set up proper scraping of individual, decoupled components. Re: your comments on dynamic cluster layouts, we should have discoverability tooling for working with such clusters anyway, so there should (eventually?) be a way set up such scraping; if anything is missing here I'd focus on that.

The current implementation has some issues:

The cluster always performs work that potentially nobody collects. More optimal use of this feature would require exactly synchronizing worker push intervals with the collector's scrape interval from the manager. This seems extremely rigid and will very likely not be used in the optimal way.
Ultimately it runs into scalability issues as number of nodes and/or metrics grows. Both of these are controlled by users and not us. No matter how much we optimize code, we can always come up with a configuration which will bog down the manager.
Users need to do fine tuning of the update interval for their setup. There is no good automatic (or even dynamic) way to pick a "good" value which simultaneously gives good granularity as well as minimizes overhead.
More generally, implementing backpressure in a push architecture requires a non-trivial protocol and makes interpreting metrics results harder.

It also brings more issues with it. Tying up a scraper for up to 1-2s because a single node is lagging behind is unacceptable. They probably time out at that point, ultimately causing the scraper to have no metrics at all instead of at least having the n-1 metrics.

I'd argue that if this a concern for users they should scrape individual nodes.