oxidecomputer / omicron

Omicron: Oxide control plane

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Resource utilization metrics could report samples even when there are no changes

bnaecker opened this issue · comments

Nexus currently keeps track of the number of resources provisioned, CPUs, memory, and disks. Those go into the utilization views in the web console:

Screen Shot 2024-03-19 at 11 17 22

As you can see there, the data is sampled irregularly. In particular, Nexus generates samples only when there are changes to the data. That is reasonable from an efficiency perspective, since it only reports deltas. However, it makes querying and the graph shown above more painful. It's impossible to know, for example, if any particular time range contains any data. That leads the console implementation to do grubby things like find the latest sample before and after the requested time range.

An alternative would be to report data for intervals even in which there are no changes. This greatly simplifies querying, graphing, and understanding the data, at the obvious expense of transferring more data. The size here is non-trivial, being linear in the number of distinct "virtual collections", which I believe means non-deleted projects, silos, and fleets.

One subtlety here is that we may still wish to report each change, not the total change in the sample period. For example, if a user provisions two new VMs in a sample period, do we report two samples, or one with the sum of the new provisioned resources? I'd expect we want the former, to avoid missing individual changes. So the reported samples would really be:

  • The last value, if there have been no changes
  • The value at each change, as a separate sample

The original implementation pre-dated RPWs, anyway, so this should be much easier to do nowadays. The periodic querying could definitely be done as a paginated walk over all "fleets / silos / projects", summing up each group in-memory and reporting the value. By performing a paginated walk, that should avoid any full-table scans.

Yeah, that could work. I think we need to maintain the samples in-memory anyway until they're fetched, so another option is to use each new provision operation as an "event" causing us to store a sample. I.e., each time we call into nexus_db_queries::provisioning::Producer::append_all_samples(), we store each new sample. When they're collected, we keep the last one, and report the on the next sample interval if there have been no such events.

I dunno what makes the most sense, we'll see when we get in there.