neondatabase / autoscaling

Postgres vertical autoscaling in k8s

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm

stradig opened this issue · comments

We want to be able to experiment with the algorithm to see which of those values can improve performance for autoscaled computes.

Tasks

  1. c/compute t/feature
    knizhnik

Need to investigate how to export data using SQL statements. This does not seem to be supported by vector.dev.

IIRC the existing metrics are exposed by sql-exporter — I think vector could just pull from there, if we want to expose it via vector.

So we have 4 possible ways to go forwad:

  1. Fetch from vector (#878)
    • Disadvantage: adds an additional delay between sql-exporter and vector
  2. Fetch from sql-exporter (#895)
    • Disadvantage: sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s
  3. Fetch from vm-monitor (neondatabase/neon#7302 (comment))
    • Disadvantage: one more place to implement working with metrics
  4. Fetch directly from postgres
    • Disadvantage: breaks abstraction layers, needs somehow to put credentials into the autoscaler-agent

@skyzh @sharnoff sounds correct? Which ones do you prefer?

My thoughts — I want to avoid adding tech debt by linking together components that weren't previously linked.

  1. Fetch from vector — modifies vector here to support sql-exporter in neondatabase/neon, adding a new link. Also has the downside of repeating metric values because the autoscaler-agent fetch frequency would be greater than vector's refresh frequency.
  2. Fetch from sql-exporter — mostly doesn't add a new link beyond what's required for this issue; the autoscaler-agent already fetches prometheus metrics from the VM. That's why I went with this approach.
  3. Fetch from vm-monitor — adds a new responsibility to vm-monitor, and would also require additional support in the autoscaler-agent. All work done on the autoscaler-agent <-> vm-monitor protocol should be approached with hazmat suits for now. It does what we need it to, but it needs a lot of work, and I'm hesitant to add more responsibilities to it until after some refactoring has taken place.
  4. Fetch directly from postgres — adds a new link between autoscaler-agent and postgres, like you said @Omrigan. And yeah, credentials would be quite tricky, requiring help from other components we don't currently rely on.

re:

sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s

The current state of #895 is to have a configurable port and frequency — we can fetch as slow as we need to. For the ext-metrics datasources, we already do query every 15s (or maybe even more frequently?). Once a secondary sql-exporter is added with just the cheap metrics, we can e.g. add support for gradual rollout of fetching from a different port, faster, eventually switching everything over once old VMs restart.

@skyzh Can you share your opinion on options 2 vs 3?

If we want to have a second sql-exporter, I'm fine with either option 2 or 3. Otherwise, there needs to be a place to fetch these metrics, and it is easier to happen in vm-monitor.

...to be specific, I assume that autoscaling agent will at some point scrape these data at a high frequency, and I don't want these SQLs to be executed when we scrape sql-exporter:

https://github.com/neondatabase/neon/blob/2d5a8462c8093fb7db7e15cea68c6d740818c39c/vm-image-spec.yaml#L161-L188

Therefore, I'm proposing not go into the normal metrics sql-exporter for autoscaling metrics.

Discussed briefly with @skyzh — tl;dr:

  • Medium-term, we want to avoid having the autoscaler-agent pull LFC metrics from the main sql-exporter
  • Short-term:
    1. We can have the autoscaler-agent pull metrics from the existing sql-exporter, just with a low frequency so we don't overload postgres
    2. We can set up a second sql-exporter to just report LFC metrics
  • Then, we can have control plane set some annotation on new VMs to tell the autoscaler-agent to fetch LFC metrics with a higher frequency from the new port — giving the desired end state while retaining support for older VMs.

Status:

  • #895 is ready to merge, just was waiting to avoid interfering with patch release
  • We found out the new metrics weren't exposed. PR to fix is neondatabase/cloud#14245
  • Remaining work after that is actually using the metrics (design + implementation of new scaling algorithm, maybe?)