Deploy / monitoring spec
mo4islona opened this issue · comments
Ingester
1. Main container
Service metrics
sqd_ingester_block_height Gauge
Last processed block. Trigger an alert if not changed 15 min
Metrics from cluster
Disk usage Should alert if disk usage > 70%
sum(kubelet_volume_stats_used_bytes{ namespace="$archive"}) by (persistentvolumeclaim) / sum(kubelet_volume_stats_capacity_bytes{ namespace="$archive"}) by (persistentvolumeclaim)
Restarts count
Should alert if ingester is restarting more than 30 min
sum by (pod) (increase(kube_pod_container_status_restarts_total{namespace="$archive"}[$__rate_interval]))
2. Sidecar
mc
service that will write files to S3
Restarts count
Should alert if ingester is restarting more than 30 min
Worker
1. Main container
Service metrics
sqd_worker_parquet_height Gauge
Last block in parquet file. Trigger an alert if not changed 15 min
sqd_worker_db_height Gauge
Last block in RocksDB. Trigger an alert if not changed 15 min
Metrics from cluster
Disk usage
Should alert if disk usage > 70%
Restarts count
Should alert if worker is restarting more than 30 min
Latency
Should alert if worker is has latency more than 10s
2. Sidecar
mc
service that will download files from S3
Restarts count
Should alert if sidecar is restarting more than 30 min