cloudfoundry / diego-release

BOSH Release for Diego

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BBS] Add request metrics for BBS endpoints

klapkov opened this issue · comments

Add request metrics for BBS endpoints

Summary

Currently BBS does not emit much information about the performance of it's endpoints. What we emit currently is RequestsCount and RequestLatency ( in regard to BBS endpoints ). They cover all the endpoints. That is why we propose to introduce a more detailed look under the hood of the BBS server. We can achieve this with the help of a module already used in the rep and locket.
https://github.com/cloudfoundry/locket/blob/main/metrics/helpers/request_metrics.go

With this helper we have a lot of more info on the performance per endpoint. It can be implemented on a handler level and emit new metrics once per minute ( the default report interval ). It gives us these metrics :

  • RequestsStarted
  • RequestsSucceeded
  • RequestsFailed
  • RequestsInFlight
  • RequestsCancelled
  • RequestLatencyMax

Now the tricky question is, which endpoints implement it. Here are most of the BBS endpoints:

//desiredLRP endpoints
"DesiredLRPSchedulingInfos", "DesiredLRPRoutingInfos", "DesiredLRPByProcessGuid", "DesiredLRPs",

//desiredLRP lifecycle endpoints
"UpdateDesireLRP", "RemoveDesiredLRP", "DesireDesiredLRP",

//actualLRP endpoints
"ActualLRPs", 

// actualLRP lifecycle endpoints
"ClaimActualLRP", "StartActualLRP", "CrashActualLRP", "FailActualLRP", "RemoveActualLRP", "RetireActualLRP",

// evacuation endpoints
"RemoveEvacuatingActualLRP", "EvacuateClaimedActualLRP", "EvacuateCrashedActualLRP", "EvacuateStoppedActualLRP", "EvacuateRunningActualLRP",

// task endpoints
"Tasks", "TaskByGuid", "DesireTask", "StartTask", "CancelTask", "RejectTask", "CompleteTask", "ResolvingTask", "DeleteTask",

Let's say we implement the helper with every one of these endpoints, which would give us perfect visibility on all operations of the BBS server. Here we have 28 endpoints. Multiplied by 6 = 168 new metrics. That is a lot.

If we do not want to introduce this many new metrics, we can try to divide them into groups. Those groups can be for example:

  • "DesiredLRPEndpoints"
  • "DesiredLRPLifecycleEndponts"
  • "ActualLRPSEndpoint"
  • "ActualLRPLifecycleEndpoints"
  • "EvacuationEndpoints"
  • "TaskEndpoints"
Maybe StartActualLRP should be in a group of it's own, since it is called periodically.  

In this case we have 36 new metrics. With this approach , we do not get quite as much information, but at least we know how a certain operation group performs. The above groups are only an example. If we go with this path, we should decide how to split these groups.

Maybe we can even make the endpoints which implement the helper configurable, so that everyone can use what best suits them.
Nevertheless, I think this topic is worth a discussion. I will come back sort of a PoC in the next days.

Diego repo

https://github.com/cloudfoundry/bbs