influxdata / community-templates

InfluxDB Community Templates: Quickly collect & analyze time series data from a range of sources: Kubernetes, MySQL, Postgres, AWS, Nginx, Jenkins, and more.

Home Page:https://www.influxdata.com/products/influxdb-templates/gallery/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update Kafka monitoring dashboard and Telegraf config as needed

samhld opened this issue · comments

Update dashboard to more closely reflect the Confluent best practices for monitoring Kafka. These can be found here: https://docs.confluent.io/current/kafka/monitoring.html

Note: Updating dashboard will very likely involve adding to what the Telegraf configuration instruments.

Cells not already in template --> metrics:

  • Number of in-sync replicas --> IsrShrinksPerSec/IsrExpandsPerSec

  • Producer metrics --> kafka.producer:type=producer-metrics,client-id=([-.w]+)

    • Compression rate
    • Response rate
    • Request rate
    • Request latencies
    • Outgoing byte rate
    • IO wait time
    • Batch sizes
  • Consumer metrics --> kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),topic=([-.w]+),partition=([-.w]+)

    • Records lag
    • Bytes consumed rate
    • Records consumed rate
    • Fetch rate
  • LeaderElectionRateAndTimeMs

  • UncleanLeaderElectionsPerSec

  • Time to service requests --> TotalTimeMs for producers, fetch-consumers, fetch-followers, queue, etc.

  • PurgatorySize

Possibly add disk, cpu, mem, and network metrics to this dashboard? That or recommend using the system template and JVM templates in conjunction -- even the ZK template.

CC: @bednar