100% latency increase on routing/v1 API

Question

100% latency increase on routing/v1 API

guseggert opened this issue 2 years ago · comments

Something I noticed this morning from Hydra metrics: at around 2022-12-16T10:36:00Z, the cid.contact/routing/v1 endpoint experienced a sudden 100% increase in latency, p50 went from ~30ms to ~60ms. This doesn't correlate with any increase in request rate to Hydras or other client-side metrics, so it seems likely to be a server-side issue. Perhaps there was a deployment around that time?

Graph:

(source)

Masih H. Derkani · Answer 1 · Fri Dec 16 2022 21:59:42 GMT+0800 (China Standard Time)

Assuming times are in UTC?

Gus Eggert · Answer 2 · Fri Dec 16 2022 22:01:09 GMT+0800 (China Standard Time)

The UTC time is in the message (2022-12-16T10:36:00Z), the image is in my local tz (UTC-5). Sorry for that confusion.

Masih H. Derkani · Answer 3 · Fri Dec 16 2022 22:01:31 GMT+0800 (China Standard Time)

Thanks Gus; looking

Gus Eggert · Answer 4 · Fri Dec 16 2022 22:03:40 GMT+0800 (China Standard Time)

Fixed graph to use UTC tz

Masih H. Derkani · Answer 5 · Fri Dec 16 2022 22:18:11 GMT+0800 (China Standard Time)

Origin latency at CloudFront, and upstream latency at nginx ingress controller both look flat. There has not been any deployments today.

This leaves me to believe there is either something we are not measuring at cid.contact or the root cause is outside of cid.contact. I am trying to think if there is an alternative source via which we could confirm the latency increase; do gateways observe the same latency increase for example (even though they are not using the new HTTP delegated routing endpoint)? Under the hood the requests are translated and hit the same execution path down the line.

Gus Eggert · Answer 6 · Fri Dec 16 2022 23:42:01 GMT+0800 (China Standard Time)

It's certainly possible that this is caused by something internal to the Hydras, causing them to write the req and read the response more slowly, but those are usually gradual degradations or align with some other metric like request rate.

Also I do not see this same spike from prod Hydras which are still running reframe.

Andrew Gillis · Answer 7 · Thu Sep 28 2023 01:25:14 GMT+0800 (China Standard Time)

Cannot reproduce (no more hydras), and no follow-up on this specific issue.