ipni / storetheindex

A directory of CIDs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add alerts for current leading indicator of slow ingest

masih opened this issue · comments

Add alerts, integrated into Slack and OpsGenie which trigger when the ingest rate slows down and the provider lag grows. We already have an alert for ingest rate stopping for more than an hour which is not catching the gap in ingest issues.

We should look at existing alternative leading indicators to alert on this. Namely:

  • Probelab providers, which check lookup success for CIDs published within 5 minutes of their publication
  • Lag value reported for providers at /provider backed. In both recent incidents NFT.Storage lag on /provider backends consistently grew. The lag for this particular provider should typically remain below 20.

Added additional alerts from metrics collected by the telemetry service. Problab data probably does not apply anymore.

Telemetry service can poll the head advertisement from NFT storage, get some multihashes from that, and then lookup those multihashes. An alert can be generated if the multihashes cannot be looked up after some amount of time. Alternatively, the NFT storage provider distance can be tracked, and an alert generated if the distance grows too large.