pinterest / DoctorK

DoctorK is a service for Kafka cluster auto healing and workload balancing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Breaking kafkastats change between 0.2.3 and 0.2.4.2

BrianGallew opened this issue · comments

For a number of clusters, I have dead or super-low-volume topics. With the 0.2.3 kafkastats client, DoctorKafka would display the cluster data correctly. However, with 0.2.4.2, hasFailure is now being set to True when the JMX collector cannot collect, e.g. BytesOutPerSec. In both cases, I get the same log message:

2019-01-10 23:36:09.108 [StatsReporter] WARN  com.pinterest.doctorkafka.stats.BrokerStatsRetriever - Got exception for doctorkafka.operator_report
javax.management.InstanceNotFoundException: kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=doctorkafka.operator_report
	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1095) ~[?:1.8.0_181]
	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:643) ~[?:1.8.0_181]
	at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678) ~[?:1.8.0_181]
	at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1445) ~[?:1.8.0_181]
	at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76) ~[?:1.8.0_181]
	at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309) ~[?:1.8.0_181]
	at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401) ~[?:1.8.0_181]
	at javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:639) ~[?:1.8.0_181]
	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) ~[?:?]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]
	at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357) ~[?:1.8.0_181]
	at sun.rmi.transport.Transport$1.run(Transport.java:200) ~[?:1.8.0_181]
	at sun.rmi.transport.Transport$1.run(Transport.java:197) ~[?:1.8.0_181]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_181]
	at sun.rmi.transport.Transport.serviceCall(Transport.java:196) ~[?:1.8.0_181]
	at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573) ~[?:1.8.0_181]
	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834) ~[?:1.8.0_181]
	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688) ~[?:1.8.0_181]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_181]
	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687) ~[?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181]
	at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:283) ~[?:1.8.0_181]
	at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:260) ~[?:1.8.0_181]
	at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161) ~[?:1.8.0_181]
	at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source) ~[?:?]
	at javax.management.remote.rmi.RMIConnectionImpl_Stub.getAttribute(Unknown Source) ~[?:1.8.0_181]
	at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.getAttribute(RMIConnector.java:903) ~[?:1.8.0_181]
	at com.pinterest.doctorkafka.stats.KafkaMetricRetrievingTask.call(KafkaMetricRetrievingTask.java:30) ~[kafkastats-0.2.4.2-jar-with-dependencies.jar:?]
	at com.pinterest.doctorkafka.stats.KafkaMetricRetrievingTask.call(KafkaMetricRetrievingTask.java:11) ~[kafkastats-0.2.4.2-jar-with-dependencies.jar:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

However, the old client sends good data:

{'amiId': 'ami-xxxxxxxx',
 'availabilityZone': 'us-east-1c',
 'cpuUsage': 1.4,
 'failureReason': None,
 'followerReplicas': [{'partition': 13, 'topic': '__consumer_offsets'},
  {'partition': 23, 'topic': '__consumer_offsets'},
  {'partition': 19, 'topic': '__consumer_offsets'},
  {'partition': 17, 'topic': '__consumer_offsets'},
  {'partition': 32, 'topic': '__consumer_offsets'},
  {'partition': 26, 'topic': '__consumer_offsets'},
  {'partition': 7, 'topic': '__consumer_offsets'},
  {'partition': 40, 'topic': '__consumer_offsets'},
  {'partition': 5, 'topic': '__consumer_offsets'},
  {'partition': 3, 'topic': '__consumer_offsets'},
  {'partition': 34, 'topic': '__consumer_offsets'},
  {'partition': 47, 'topic': '__consumer_offsets'},
  {'partition': 16, 'topic': '__consumer_offsets'},
  {'partition': 14, 'topic': '__consumer_offsets'},
  {'partition': 41, 'topic': '__consumer_offsets'},
  {'partition': 10, 'topic': '__consumer_offsets'},
  {'partition': 49, 'topic': '__consumer_offsets'},
  {'partition': 31, 'topic': '__consumer_offsets'},
  {'partition': 29, 'topic': '__consumer_offsets'},
  {'partition': 0, 'topic': 'doctorkafka.operator_report'},
  {'partition': 25, 'topic': '__consumer_offsets'},
  {'partition': 8, 'topic': '__consumer_offsets'},
  {'partition': 35, 'topic': '__consumer_offsets'},
  {'partition': 4, 'topic': '__consumer_offsets'},
  {'partition': 2, 'topic': '__consumer_offsets'}],
 'freeDiskSpaceInBytes': 4291677859840,
 'hasFailure': False,
 'id': 10286,
 'inReassignmentReplicas': [],
 'instanceType': 'm5.large',
 'kafkaVersion': '1.1.1',
 'leaderReplicaStats': [{'bytesIn15MinMeanRate': 78,
   'bytesIn1MinMeanRate': 79,
   'bytesIn5MinMeanRate': 78,
   'bytesOut15MinMeanRate': 888,
   'bytesOut1MinMeanRate': 1517,
   'bytesOut5MinMeanRate': 1863,
   'cpuUsage': 1.4,
   'endOffset': 3778553,
   'inReassignment': False,
   'isLeader': True,
   'logSizeInBytes': 280141429,
   'numLogSegments': 1,
   'partition': 1,
   'startOffset': 3701289,
   'timestamp': 1547162979624,
   'topic': 'doctorkafka.brokerstats',
   'underReplicated': False}],
 'leaderReplicas': [{'partition': 1, 'topic': 'doctorkafka.brokerstats'}],
 'leadersBytesIn15MinRate': 78,
 'leadersBytesIn1MinRate': 79,
 'leadersBytesIn5MinRate': 78,
 'leadersBytesOut15MinRate': 888,
 'leadersBytesOut1MinRate': 1517,
 'leadersBytesOut5MinRate': 1863,
 'logFilesPath': '/mnt/kafka/data',
 'name': 'ip-10-10-2-86',
 'numLeaders': 1,
 'numReplicas': 26,
 'rackId': None,
 'statsVersion': '0.1.15',
 'sysBytesIn1MinRate': 0,
 'sysBytesOut1MinRate': 0,
 'timestamp': 1547162978968,
 'topicsBytesIn15MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 78},
 'topicsBytesIn1MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 79},
 'topicsBytesIn5MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 78},
 'topicsBytesOut15MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 888},
 'topicsBytesOut1MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 1517},
 'topicsBytesOut5MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 1863},
 'totalDiskSpaceInBytes': 4292333535232,
 'zkUrl': '10.10.16.238:2181,10.10.2.10:2181,10.10.6.32:2181'}

while the new kafkastats sends bad data:

{'amiId': 'ami-xxxxxxx',
 'availabilityZone': 'us-east-1c',
 'cpuUsage': 4.0,
 'failureReason': None,
 'followerReplicas': [{'partition': 13, 'topic': '__consumer_offsets'},
  {'partition': 23, 'topic': '__consumer_offsets'},
  {'partition': 19, 'topic': '__consumer_offsets'},
  {'partition': 17, 'topic': '__consumer_offsets'},
  {'partition': 32, 'topic': '__consumer_offsets'},
  {'partition': 26, 'topic': '__consumer_offsets'},
  {'partition': 7, 'topic': '__consumer_offsets'},
  {'partition': 40, 'topic': '__consumer_offsets'},
  {'partition': 5, 'topic': '__consumer_offsets'},
  {'partition': 3, 'topic': '__consumer_offsets'},
  {'partition': 34, 'topic': '__consumer_offsets'},
  {'partition': 47, 'topic': '__consumer_offsets'},
  {'partition': 16, 'topic': '__consumer_offsets'},
  {'partition': 14, 'topic': '__consumer_offsets'},
  {'partition': 41, 'topic': '__consumer_offsets'},
  {'partition': 10, 'topic': '__consumer_offsets'},
  {'partition': 49, 'topic': '__consumer_offsets'},
  {'partition': 31, 'topic': '__consumer_offsets'},
  {'partition': 29, 'topic': '__consumer_offsets'},
  {'partition': 0, 'topic': 'doctorkafka.operator_report'},
  {'partition': 25, 'topic': '__consumer_offsets'},
  {'partition': 8, 'topic': '__consumer_offsets'},
  {'partition': 35, 'topic': '__consumer_offsets'},
  {'partition': 4, 'topic': '__consumer_offsets'},
  {'partition': 2, 'topic': '__consumer_offsets'}],
 'freeDiskSpaceInBytes': 4291677859840,
 'hasFailure': True,
 'id': 10286,
 'inReassignmentReplicas': [],
 'instanceType': 'm5.large',
 'kafkaVersion': '1.1.1',
 'leaderReplicaStats': [{'bytesIn15MinMeanRate': 78,
   'bytesIn1MinMeanRate': 81,
   'bytesIn5MinMeanRate': 79,
   'bytesOut15MinMeanRate': 1013,
   'bytesOut1MinMeanRate': 13533,
   'bytesOut5MinMeanRate': 2859,
   'cpuUsage': 4.0,
   'endOffset': 3778503,
   'inReassignment': False,
   'isLeader': True,
   'logSizeInBytes': 280130728,
   'numLogSegments': 1,
   'partition': 1,
   'startOffset': 3701289,
   'timestamp': 1547162844124,
   'topic': 'doctorkafka.brokerstats',
   'underReplicated': False}],
 'leaderReplicas': [{'partition': 1, 'topic': 'doctorkafka.brokerstats'}],
 'leadersBytesIn15MinRate': 78,
 'leadersBytesIn1MinRate': 81,
 'leadersBytesIn5MinRate': 79,
 'leadersBytesOut15MinRate': 1013,
 'leadersBytesOut1MinRate': 13533,
 'leadersBytesOut5MinRate': 2859,
 'logFilesPath': '/mnt/kafka/data',
 'name': 'ip-10-10-2-86',
 'numLeaders': 1,
 'numReplicas': 26,
 'rackId': None,
 'statsVersion': '0.1.15',
 'sysBytesIn1MinRate': 0,
 'sysBytesOut1MinRate': 0,
 'timestamp': 1547162843992,
 'topicsBytesIn15MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 78},
 'topicsBytesIn1MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 81},
 'topicsBytesIn5MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 79},
 'topicsBytesOut15MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 1013},
 'topicsBytesOut1MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 13533},
 'topicsBytesOut5MinRate': {'__consumer_offsets': 0,
  'doctorkafka.brokerstats': 2859},
 'totalDiskSpaceInBytes': 4292333535232,
 'zkUrl': '10.10.16.238:2181,10.10.2.10:2181,10.10.6.32:2181'}

@BrianGallew thanks for reporting the issue! we have put a fix #76 for this. can you try again to see if it resolves the problem on your side?

I'm building it right now.

Awesome, yes, that fixed it!