Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm

Question

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm

stradig opened this issue 4 months ago · comments

We want to be able to experiment with the algorithm to see which of those values can improve performance for autoscaled computes.

Tasks

Beta Give feedback

agent: Support fetching LFC metrics (but don't use them yet) #895
vm-image: add sqlexporter for autoscaling metrics neon#7514
neondatabase/cloud#14245
LFC: using rolling hyperloglog for correct working set estimation neon#7466

c/compute t/feature
vm-image: Expose new LFC working set size metrics neon#8298
agent/core: Implement LFC-aware scaling #1003
neondatabase/cloud#14929
Options

Alex Chi Z. · Answer 1 · Tue Mar 26 2024 02:41:29 GMT+0800 (China Standard Time)

Need to investigate how to export data using SQL statements. This does not seem to be supported by vector.dev.

Em Sharnoff · Answer 2 · Tue Mar 26 2024 02:58:37 GMT+0800 (China Standard Time)

IIRC the existing metrics are exposed by sql-exporter — I think vector could just pull from there, if we want to expose it via vector.

Alex Chi Z. · Answer 3 · Tue Mar 26 2024 03:04:54 GMT+0800 (China Standard Time)

yep, I found https://vector.dev/docs/reference/configuration/sources/prometheus_scrape/ that directly scrapes exporter data.

Oleg Vasilev · Answer 4 · Thu Apr 11 2024 21:36:37 GMT+0800 (China Standard Time)

So we have 4 possible ways to go forwad:

Fetch from vector (#878)
- Disadvantage: adds an additional delay between sql-exporter and vector
Fetch from sql-exporter (#895)
- Disadvantage: sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s
Fetch from vm-monitor (neondatabase/neon#7302 (comment))
- Disadvantage: one more place to implement working with metrics
Fetch directly from postgres
- Disadvantage: breaks abstraction layers, needs somehow to put credentials into the autoscaler-agent

@skyzh @sharnoff sounds correct? Which ones do you prefer?

Em Sharnoff · Answer 5 · Thu Apr 11 2024 23:09:38 GMT+0800 (China Standard Time)

My thoughts — I want to avoid adding tech debt by linking together components that weren't previously linked.

Fetch from vector — modifies vector here to support sql-exporter in neondatabase/neon, adding a new link. Also has the downside of repeating metric values because the autoscaler-agent fetch frequency would be greater than vector's refresh frequency.
Fetch from sql-exporter — mostly doesn't add a new link beyond what's required for this issue; the autoscaler-agent already fetches prometheus metrics from the VM. That's why I went with this approach.
Fetch from vm-monitor — adds a new responsibility to vm-monitor, and would also require additional support in the autoscaler-agent. All work done on the autoscaler-agent <-> vm-monitor protocol should be approached with hazmat suits for now. It does what we need it to, but it needs a lot of work, and I'm hesitant to add more responsibilities to it until after some refactoring has taken place.
Fetch directly from postgres — adds a new link between autoscaler-agent and postgres, like you said @Omrigan. And yeah, credentials would be quite tricky, requiring help from other components we don't currently rely on.

re:

sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s

The current state of #895 is to have a configurable port and frequency — we can fetch as slow as we need to. For the ext-metrics datasources, we already do query every 15s (or maybe even more frequently?). Once a secondary sql-exporter is added with just the cheap metrics, we can e.g. add support for gradual rollout of fetching from a different port, faster, eventually switching everything over once old VMs restart.

Oleg Vasilev · Answer 6 · Tue Apr 16 2024 14:24:09 GMT+0800 (China Standard Time)

@skyzh Can you share your opinion on options 2 vs 3?

Alex Chi Z. · Answer 7 · Tue Apr 16 2024 14:52:53 GMT+0800 (China Standard Time)

If we want to have a second sql-exporter, I'm fine with either option 2 or 3. Otherwise, there needs to be a place to fetch these metrics, and it is easier to happen in vm-monitor.

Alex Chi Z. · Answer 8 · Tue Apr 16 2024 15:30:00 GMT+0800 (China Standard Time)

...to be specific, I assume that autoscaling agent will at some point scrape these data at a high frequency, and I don't want these SQLs to be executed when we scrape sql-exporter:

https://github.com/neondatabase/neon/blob/2d5a8462c8093fb7db7e15cea68c6d740818c39c/vm-image-spec.yaml#L161-L188

Therefore, I'm proposing not go into the normal metrics sql-exporter for autoscaling metrics.

Em Sharnoff · Answer 9 · Tue Apr 23 2024 01:50:22 GMT+0800 (China Standard Time)

Discussed briefly with @skyzh — tl;dr:

Medium-term, we want to avoid having the autoscaler-agent pull LFC metrics from the main sql-exporter
Short-term:
1. We can have the autoscaler-agent pull metrics from the existing sql-exporter, just with a low frequency so we don't overload postgres
2. We can set up a second sql-exporter to just report LFC metrics
Then, we can have control plane set some annotation on new VMs to tell the autoscaler-agent to fetch LFC metrics with a higher frequency from the new port — giving the desired end state while retaining support for older VMs.

Em Sharnoff · Answer 10 · Mon Jun 10 2024 23:17:43 GMT+0800 (China Standard Time)

Status:

#895 is ready to merge, just was waiting to avoid interfering with patch release
We found out the new metrics weren't exposed. PR to fix is neondatabase/cloud#14245
Remaining work after that is actually using the metrics (design + implementation of new scaling algorithm, maybe?)