Deploy / monitoring spec

Question

Deploy / monitoring spec

mo4islona opened this issue 2 years ago · comments

Eugene Formanenko commented 2 years ago

Ingester

1. Main container

Service metrics

sqd_ingester_block_height Gauge
Last processed block. Trigger an alert if not changed 15 min

Metrics from cluster

Disk usage Should alert if disk usage > 70%

sum(kubelet_volume_stats_used_bytes{ namespace="$archive"}) by (persistentvolumeclaim) / sum(kubelet_volume_stats_capacity_bytes{ namespace="$archive"}) by (persistentvolumeclaim)

Restarts count
Should alert if ingester is restarting more than 30 min

sum by (pod) (increase(kube_pod_container_status_restarts_total{namespace="$archive"}[$__rate_interval]))

2. Sidecar

mc service that will write files to S3

Restarts count
Should alert if ingester is restarting more than 30 min

Worker

1. Main container

Service metrics

sqd_worker_parquet_height Gauge
Last block in parquet file. Trigger an alert if not changed 15 min

sqd_worker_db_height Gauge
Last block in RocksDB. Trigger an alert if not changed 15 min

Metrics from cluster

Disk usage
Should alert if disk usage > 70%

Restarts count
Should alert if worker is restarting more than 30 min

Latency
Should alert if worker is has latency more than 10s

2. Sidecar

mc service that will download files from S3

Restarts count
Should alert if sidecar is restarting more than 30 min