Add alerts for current leading indicator of slow ingest

Question

Add alerts for current leading indicator of slow ingest

masih opened this issue 10 months ago · comments

Masih H. Derkani commented 10 months ago

Add alerts, integrated into Slack and OpsGenie which trigger when the ingest rate slows down and the provider lag grows. We already have an alert for ingest rate stopping for more than an hour which is not catching the gap in ingest issues.

We should look at existing alternative leading indicators to alert on this. Namely:

Probelab providers, which check lookup success for CIDs published within 5 minutes of their publication
Lag value reported for providers at /provider backed. In both recent incidents NFT.Storage lag on /provider backends consistently grew. The lag for this particular provider should typically remain below 20.

Andrew Gillis · Answer 1 · Thu Nov 16 2023 05:14:12 GMT+0800 (China Standard Time)

Added additional alerts from metrics collected by the telemetry service. Problab data probably does not apply anymore.

Telemetry service can poll the head advertisement from NFT storage, get some multihashes from that, and then lookup those multihashes. An alert can be generated if the multihashes cannot be looked up after some amount of time. Alternatively, the NFT storage provider distance can be tracked, and an alert generated if the distance grows too large.