Slow stats collection outside of AWS
BrianGallew opened this issue · comments
All my Kafka brokers in AWS have no problems meeting the 30 second polling interval for kafkastats. However, all of the brokers on physical hardware show crazy intervals between metric publishing events.
2019-01-16 18:22:20.856 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547662170130, "id": 1441
2019-01-16 18:43:50.866 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547663459364, "id": 1441
2019-01-16 18:56:41.630 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547664230872, "id": 1441
2019-01-16 19:09:32.229 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547665001633, "id": 1441
2019-01-16 19:22:22.797 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547665772231, "id": 1441
2019-01-16 19:44:07.842 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547667076506, "id": 1441
2019-01-16 19:56:58.609 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547667847848, "id": 1441
2019-01-16 20:09:49.174 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547668618610, "id": 1441
2019-01-16 20:22:39.883 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547669389176, "id": 1441
2019-01-16 20:44:25.059 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547670693733, "id": 1441
2019-01-16 20:57:15.843 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547671465064, "id": 1441
2019-01-16 21:10:06.384 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - published to kafka : {"timestamp": 1547672235845, "id": 1441
The brokers are averaging 90% idle, with reasonable amounts of free memory. Any ideas where I should be looking to see what it's trying to do?
I've tried a dozen times to attach a screenshot from glances but ... it's not working. So I'll put in this:
CPU 7.0% nice: 0.0% LOAD 32-core MEM 24.3% active: 73.2G SWAP 0.0%
user: 4.6% irq: 0.0% 1 min: 1.69 total: 126G inactive: 47.2G total: 0
system: 1.7% iowait: 0.0% 5 min: 2.24 used: 30.6G buffers: 2.12M used: 0
idle: 93.0% steal: 0.0% 15 min: 2.38 free: 95.3G cached: 94.8G free: 0
NETWORK Rx/s Tx/s Processes filter: .*kafka.* (press ENTER to edit)
bond0 135Mb 58.7Mb TASKS 4 (324 thr), 0 run, 4 slp, 0 oth sorted automatically by cpu_percent, flat view
eno1 0b 0b
eno2 0b 0b CPU% MEM% VIRT RES PID USER NI S TIME+ IOR/s IOW/s Command
ens1f4 82Kb 0b 227.2 20.0 64.5G 25.2G 2072 kafka 0 S 59:18.40 0 0 /usr/lib/jvm/java-8-oracle/bin/java -Xmx24G -Xms24G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:Initi
ens1f4d1 135Mb 58.7Mb 0.3 0.2 5.22G 243M 15576 nobody 0 S 0:08.24 0 0 /usr/bin/java -server -Xmx800M -Xms800M -verbosegc -Xloggc:/var/log/doctorkafka/gc.log -XX:+UseGCLogFileRo
lo 102Kb 102Kb 0.0 0.2 4.82G 293M 18656 dd-agent 0 S 29:29.53 0 0 java -Xms50m -Xmx200m -classpath /opt/datadog-agent/agent/checks/libs/jmxfetch-0.20.1-jar-with-dependencie
0.0 0.0 5.90M 688K 9160 bgallew 0 S 0:00.00 0 0 tail -F /var/log/doctorkafka/kafkastats.log
DISK I/O R/s W/s
md0 382K 14.9M
sda1 0 55K
I am not sure about root cause of the slowness in polling for stats. Can you relax kafkastats polling interval to 60 seconds -pollingintervalinseconds 60
will that solve the problem?
I've tried that (and longer times!). No bueno.
It looks like the actual issue is that /usr/bin/ec2metadata keeps getting run over and over even though the provided metadata can't really change after the system starts up. I'm going to look to see if I can figure out why it's getting re-run and stop that.