pinterest / DoctorK

DoctorK is a service for Kafka cluster auto healing and workload balancing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Slow stats collection outside of AWS

BrianGallew opened this issue · comments

All my Kafka brokers in AWS have no problems meeting the 30 second polling interval for kafkastats. However, all of the brokers on physical hardware show crazy intervals between metric publishing events.

2019-01-16 18:22:20.856 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547662170130, "id": 1441
2019-01-16 18:43:50.866 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547663459364, "id": 1441
2019-01-16 18:56:41.630 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547664230872, "id": 1441
2019-01-16 19:09:32.229 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547665001633, "id": 1441
2019-01-16 19:22:22.797 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547665772231, "id": 1441
2019-01-16 19:44:07.842 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547667076506, "id": 1441
2019-01-16 19:56:58.609 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547667847848, "id": 1441
2019-01-16 20:09:49.174 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547668618610, "id": 1441
2019-01-16 20:22:39.883 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547669389176, "id": 1441
2019-01-16 20:44:25.059 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547670693733, "id": 1441
2019-01-16 20:57:15.843 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547671465064, "id": 1441
2019-01-16 21:10:06.384 [StatsReporter] INFO  com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547672235845, "id": 1441

The brokers are averaging 90% idle, with reasonable amounts of free memory. Any ideas where I should be looking to see what it's trying to do?

I've tried a dozen times to attach a screenshot from glances but ... it's not working. So I'll put in this:


CPU       7.0%  nice:     0.0%                                     LOAD    32-core                                     MEM     24.3%  active:    73.2G                                     SWAP      0.0%
user:     4.6%  irq:      0.0%                                     1 min:    1.69                                      total:   126G  inactive:  47.2G                                     total:       0
system:   1.7%  iowait:   0.0%                                     5 min:    2.24                                      used:   30.6G  buffers:   2.12M                                     used:        0
idle:    93.0%  steal:    0.0%                                     15 min:   2.38                                      free:   95.3G  cached:    94.8G                                     free:        0

NETWORK     Rx/s   Tx/s   Processes filter: .*kafka.* (press ENTER to edit)
bond0      135Mb 58.7Mb   TASKS 4 (324 thr), 0 run, 4 slp, 0 oth sorted automatically by cpu_percent, flat view
eno1          0b     0b
eno2          0b     0b     CPU%  MEM%  VIRT   RES   PID USER        NI S    TIME+ IOR/s IOW/s Command
ens1f4      82Kb     0b    227.2  20.0 64.5G 25.2G  2072 kafka        0 S 59:18.40     0     0 /usr/lib/jvm/java-8-oracle/bin/java -Xmx24G -Xms24G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:Initi
ens1f4d1   135Mb 58.7Mb      0.3   0.2 5.22G  243M 15576 nobody       0 S  0:08.24     0     0 /usr/bin/java -server -Xmx800M -Xms800M -verbosegc -Xloggc:/var/log/doctorkafka/gc.log -XX:+UseGCLogFileRo
lo         102Kb  102Kb      0.0   0.2 4.82G  293M 18656 dd-agent     0 S 29:29.53     0     0 java -Xms50m -Xmx200m -classpath /opt/datadog-agent/agent/checks/libs/jmxfetch-0.20.1-jar-with-dependencie
                             0.0   0.0 5.90M  688K  9160 bgallew      0 S  0:00.00     0     0 tail -F /var/log/doctorkafka/kafkastats.log
DISK I/O     R/s    W/s
md0         382K  14.9M
sda1           0    55K

I am not sure about root cause of the slowness in polling for stats. Can you relax kafkastats polling interval to 60 seconds -pollingintervalinseconds 60 will that solve the problem?

I've tried that (and longer times!). No bueno.

It looks like the actual issue is that /usr/bin/ec2metadata keeps getting run over and over even though the provided metadata can't really change after the system starts up. I'm going to look to see if I can figure out why it's getting re-run and stop that.