Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm
stradig opened this issue · comments
We want to be able to experiment with the algorithm to see which of those values can improve performance for autoscaled computes.
Need to investigate how to export data using SQL statements. This does not seem to be supported by vector.dev.
IIRC the existing metrics are exposed by sql-exporter — I think vector could just pull from there, if we want to expose it via vector.
yep, I found https://vector.dev/docs/reference/configuration/sources/prometheus_scrape/ that directly scrapes exporter data.
So we have 4 possible ways to go forwad:
- Fetch from vector (#878)
- Disadvantage: adds an additional delay between sql-exporter and vector
- Fetch from sql-exporter (#895)
- Disadvantage: sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s
- Fetch from vm-monitor (neondatabase/neon#7302 (comment))
- Disadvantage: one more place to implement working with metrics
- Fetch directly from postgres
- Disadvantage: breaks abstraction layers, needs somehow to put credentials into the autoscaler-agent
My thoughts — I want to avoid adding tech debt by linking together components that weren't previously linked.
- Fetch from vector — modifies vector here to support sql-exporter in
neondatabase/neon
, adding a new link. Also has the downside of repeating metric values because the autoscaler-agent fetch frequency would be greater than vector's refresh frequency. - Fetch from sql-exporter — mostly doesn't add a new link beyond what's required for this issue; the autoscaler-agent already fetches prometheus metrics from the VM. That's why I went with this approach.
- Fetch from vm-monitor — adds a new responsibility to vm-monitor, and would also require additional support in the autoscaler-agent. All work done on the autoscaler-agent <-> vm-monitor protocol should be approached with hazmat suits for now. It does what we need it to, but it needs a lot of work, and I'm hesitant to add more responsibilities to it until after some refactoring has taken place.
- Fetch directly from postgres — adds a new link between autoscaler-agent and postgres, like you said @Omrigan. And yeah, credentials would be quite tricky, requiring help from other components we don't currently rely on.
re:
sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s
The current state of #895 is to have a configurable port and frequency — we can fetch as slow as we need to. For the ext-metrics datasources, we already do query every 15s (or maybe even more frequently?). Once a secondary sql-exporter is added with just the cheap metrics, we can e.g. add support for gradual rollout of fetching from a different port, faster, eventually switching everything over once old VMs restart.
@skyzh Can you share your opinion on options 2 vs 3?
If we want to have a second sql-exporter, I'm fine with either option 2 or 3. Otherwise, there needs to be a place to fetch these metrics, and it is easier to happen in vm-monitor.
...to be specific, I assume that autoscaling agent will at some point scrape these data at a high frequency, and I don't want these SQLs to be executed when we scrape sql-exporter:
Therefore, I'm proposing not go into the normal metrics sql-exporter for autoscaling metrics.
Discussed briefly with @skyzh — tl;dr:
- Medium-term, we want to avoid having the autoscaler-agent pull LFC metrics from the main sql-exporter
- Short-term:
- We can have the autoscaler-agent pull metrics from the existing sql-exporter, just with a low frequency so we don't overload postgres
- We can set up a second sql-exporter to just report LFC metrics
- Then, we can have control plane set some annotation on new VMs to tell the autoscaler-agent to fetch LFC metrics with a higher frequency from the new port — giving the desired end state while retaining support for older VMs.
Status:
- #895 is ready to merge, just was waiting to avoid interfering with patch release
- We found out the new metrics weren't exposed. PR to fix is neondatabase/cloud#14245
- Remaining work after that is actually using the metrics (design + implementation of new scaling algorithm, maybe?)