Update Kafka monitoring dashboard and Telegraf config as needed
samhld opened this issue · comments
Update dashboard to more closely reflect the Confluent best practices for monitoring Kafka. These can be found here: https://docs.confluent.io/current/kafka/monitoring.html
Note: Updating dashboard will very likely involve adding to what the Telegraf configuration instruments.
Cells not already in template --> metrics:
-
Number of in-sync replicas -->
IsrShrinksPerSec
/IsrExpandsPerSec
-
Producer metrics -->
kafka.producer:type=producer-metrics,client-id=([-.w]+)
- Compression rate
- Response rate
- Request rate
- Request latencies
- Outgoing byte rate
- IO wait time
- Batch sizes
-
Consumer metrics -->
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),topic=([-.w]+),partition=([-.w]+)
- Records lag
- Bytes consumed rate
- Records consumed rate
- Fetch rate
-
LeaderElectionRateAndTimeMs
-
UncleanLeaderElectionsPerSec
-
Time to service requests -->
TotalTimeMs
for producers, fetch-consumers, fetch-followers, queue, etc. -
PurgatorySize
Possibly add disk, cpu, mem, and network metrics to this dashboard? That or recommend using the system template and JVM templates in conjunction -- even the ZK template.
CC: @bednar