oxidecomputer / omicron

Omicron: Oxide control plane

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reconfigurator: Removing services should unregister them from oximeter

jgallagher opened this issue · comments

The only service that currently registers itself as a metrics producer is Nexus, so at least for Nexus, we need to:

  1. Remove the corresponding row in metric_producer for the service.
  2. Notify the assigned Oximeter collector that it should stop polling the producer.

Expanding some details on these, from poking around and chatting with @bnaecker:

  1. metric_producer.id is the service ID, when metric_producer.kind = 'service'. I believe the only consumer of this table is when an Oximeter collector posts itself to Nexus, Nexus notifies it of all the producers it should be polling.
  2. Instances do this today by calling Nexus::unassign_producer. Reconfigurator will need to do the same (which will require some refactoring to make that method usable from the blueprint execution background task).

We should probably update this in RFD 459 too.

@andrewjstone, @smklein, @jgallagher and I discussed the issue of deregistering an oximeter producer today in a video call.

The problem

The main issue brought up is how to deregister oximeter producers when we remove or move a service. Today, any services that want to produce metric data register with Nexus. Nexus finds an available oximeter-collector from a table in the DB; records the assignment of that collector to that producer in another table; and then notifies the oximeter-collector itself so that it can start collecting from the producer.

But what happens when a service wants to deregister? Or more problematically, what if a service simply goes away? Right now, the only time this happens is on instance shutdown, when Nexus undoes all the above (remove the records, and tell oximeter-collector to stop). But in general, services may not deregister or may not be able to deregister, and we'd like to have that happen automatically.

Proposal

The proposal we all agreed on uses the idea of a lease. Upon registering with Nexus, the producer gets back a lease interval from Nexus. The timestamp of this registration is also recorded in Nexus. (I actually think it already is.) The producer will be required to re-register within some fixed fraction of that interval (say 1/4th or 1/10th) in order for collection to continue.

A new RPW in Nexus will be responsible for enforcing this re-lease interval. It will periodically scan the table for expired producers, and notify oximeter to stop collecting from them. Then the record will be hard deleted.

Separately, the oximeter_producer::Server type used to register with Nexus will internally spawn a task that will re-register with Nexus within its required interval. The producer should probably expose some information about this task, such as the time from the last successful registration and the time until it next registers. But the producer themselves otherwise doesn't need to know anything about this process.

Other notes

We also discussed the idea that oximeter could be notified of the expiration interval itself, and stop collecting from the producer after that time. This would catch the case where Nexus failed to notify it (the RPW failed, or Nexus couldn't reach oximeter), and the producer would still be deregistered. This might be useful to do, but we opted to defer it. The failure mode is pretty minor -- either we get more data than we thought (if the producer and collector can talk, but Nexus can't talk to oximeter), or oximeter keeps trying for too long (if Nexus can't talk to it). Both seem relatively benign, and can be done later if we need it.

Wanted to make a note about rolling this out. We probably want to avoid actually deregistering producers who don't update their lease until we can update them to do so. E.g., we need to add the internal lease-update into the oximeter_producer::Server; update everyone's dep on oximeter-producer past that; and then roll out the lease-enforcing RPW in Nexus.

Proposal Tweak

We realized that this RPW plan:

A new RPW in Nexus will be responsible for enforcing this re-lease interval. It will periodically scan the table for expired producers, and notify oximeter to stop collecting from them. Then the record will be hard deleted.

still leaves open a race window if we get this (unlikely) sequence:

  1. The RPW finds an expired producer
  2. The RPW successfully notifies oximeter that the producer's lease has expired
  3. oximeter crashes and restarts
  4. oximeter sends a request to Nexus to get its producers. Nexus responds with the contents of the database table, which still include the producer that expired in step 1.
  5. The RPW successfully deletes the expired producer record from the DB

At the end of this, the RPW believes it has does its job, but oximeter will still have the expired producer.

Instead, we propose the following:

  • oximeter will periodically send a request to Nexus to refresh its list of producers. This should be more or less equivalent to the request it does when it's starting up, except it may need to remove producers that are no longer present.
  • The RPW no longer needs to notify oximeter at all; its only requirement is cleaning expired producer records out of the database.
  • As a happy path optimization, the RPW can make an attempt to notify oximeter when it's removing an expired producer. If this is successful, it means oximeter's producer list gets updated more quickly; if it's unsuccessful, oximeter will eventually find out the new list anyway (the next time it re-requests its producer list).

I'm working on the producer-renewal side of things, and I think it presents a few challenges in terms of the update path. Originally, the Nexus internal API for producer registration returned a 204, which has no content. To support the automatic renewal, we'd like to return the lease duration, and so I want to change the response code to a 201.

Progenitor automatically handles responses that it does not recognize as part of the API description. Specifically, it stuffs them into an Error::UnexpectedResponse:

https://github.com/oxidecomputer/progenitor/blob/4a3dfec3926f1f9db78eb6dc90087a1e2a1f9e45/progenitor-impl/src/method.rs#L1126-L1137

The current implementation of the registration method does not handle this case particularly well. It will always return an error here, and mark it as non-retryable:

client.cpapi_producers_post(&server_info.into()).await.map(|_| ()).map_err(
|err| {
let retryable = match &err {
nexus_client::Error::CommunicationError(..) => true,
nexus_client::Error::ErrorResponse(resp) => {
resp.status().is_server_error()
}
_ => false,
};
let msg = err.to_string();
Error::RegistrationError { retryable, msg }
},
)
}

For consumers inside of Omicron, this is not a huge deal. But for things like Propolis, which are out of tree, updating the Nexus OpenAPI description will mean there is a period where the Nexus server returns a successful response, but which the producer will interpret as a fatal error.

To handle this, I think we can make one small preparatory PR to Omicron, and then filter it through the out of tree consumers. We can have the current registration function check for UnexpectedResponse, specifically looking for any successful response (even though it didn't recognize it). So the consumer might get a 201 (with some data), but would not fail. Any 200-level response would mean that the producer registered successfully.

Then we can update the Nexus API to return a 201, meaning those clients above continue to work correctly. And then we can make the bigger change to update the producer to renew its lease.

@jgallagher This sort of lost steam in the push around release 8. I think that all the out-of-tree consumers of the registration API have been updated, and so everyone should now automatically re-register in the background. I think we're safe to flip the switch on the GC background task, if you want to go ahead and do that.