Breaking kafkastats change between 0.2.3 and 0.2.4.2
BrianGallew opened this issue · comments
For a number of clusters, I have dead or super-low-volume topics. With the 0.2.3 kafkastats client, DoctorKafka would display the cluster data correctly. However, with 0.2.4.2, hasFailure is now being set to True when the JMX collector cannot collect, e.g. BytesOutPerSec. In both cases, I get the same log message:
2019-01-10 23:36:09.108 [StatsReporter] WARN com.pinterest.doctorkafka.stats.BrokerStatsRetriever - Got exception for doctorkafka.operator_report
javax.management.InstanceNotFoundException: kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=doctorkafka.operator_report
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1095) ~[?:1.8.0_181]
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:643) ~[?:1.8.0_181]
at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678) ~[?:1.8.0_181]
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1445) ~[?:1.8.0_181]
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76) ~[?:1.8.0_181]
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309) ~[?:1.8.0_181]
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401) ~[?:1.8.0_181]
at javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:639) ~[?:1.8.0_181]
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357) ~[?:1.8.0_181]
at sun.rmi.transport.Transport$1.run(Transport.java:200) ~[?:1.8.0_181]
at sun.rmi.transport.Transport$1.run(Transport.java:197) ~[?:1.8.0_181]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_181]
at sun.rmi.transport.Transport.serviceCall(Transport.java:196) ~[?:1.8.0_181]
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573) ~[?:1.8.0_181]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834) ~[?:1.8.0_181]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688) ~[?:1.8.0_181]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_181]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181]
at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:283) ~[?:1.8.0_181]
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:260) ~[?:1.8.0_181]
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161) ~[?:1.8.0_181]
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source) ~[?:?]
at javax.management.remote.rmi.RMIConnectionImpl_Stub.getAttribute(Unknown Source) ~[?:1.8.0_181]
at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.getAttribute(RMIConnector.java:903) ~[?:1.8.0_181]
at com.pinterest.doctorkafka.stats.KafkaMetricRetrievingTask.call(KafkaMetricRetrievingTask.java:30) ~[kafkastats-0.2.4.2-jar-with-dependencies.jar:?]
at com.pinterest.doctorkafka.stats.KafkaMetricRetrievingTask.call(KafkaMetricRetrievingTask.java:11) ~[kafkastats-0.2.4.2-jar-with-dependencies.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
However, the old client sends good data:
{'amiId': 'ami-xxxxxxxx',
'availabilityZone': 'us-east-1c',
'cpuUsage': 1.4,
'failureReason': None,
'followerReplicas': [{'partition': 13, 'topic': '__consumer_offsets'},
{'partition': 23, 'topic': '__consumer_offsets'},
{'partition': 19, 'topic': '__consumer_offsets'},
{'partition': 17, 'topic': '__consumer_offsets'},
{'partition': 32, 'topic': '__consumer_offsets'},
{'partition': 26, 'topic': '__consumer_offsets'},
{'partition': 7, 'topic': '__consumer_offsets'},
{'partition': 40, 'topic': '__consumer_offsets'},
{'partition': 5, 'topic': '__consumer_offsets'},
{'partition': 3, 'topic': '__consumer_offsets'},
{'partition': 34, 'topic': '__consumer_offsets'},
{'partition': 47, 'topic': '__consumer_offsets'},
{'partition': 16, 'topic': '__consumer_offsets'},
{'partition': 14, 'topic': '__consumer_offsets'},
{'partition': 41, 'topic': '__consumer_offsets'},
{'partition': 10, 'topic': '__consumer_offsets'},
{'partition': 49, 'topic': '__consumer_offsets'},
{'partition': 31, 'topic': '__consumer_offsets'},
{'partition': 29, 'topic': '__consumer_offsets'},
{'partition': 0, 'topic': 'doctorkafka.operator_report'},
{'partition': 25, 'topic': '__consumer_offsets'},
{'partition': 8, 'topic': '__consumer_offsets'},
{'partition': 35, 'topic': '__consumer_offsets'},
{'partition': 4, 'topic': '__consumer_offsets'},
{'partition': 2, 'topic': '__consumer_offsets'}],
'freeDiskSpaceInBytes': 4291677859840,
'hasFailure': False,
'id': 10286,
'inReassignmentReplicas': [],
'instanceType': 'm5.large',
'kafkaVersion': '1.1.1',
'leaderReplicaStats': [{'bytesIn15MinMeanRate': 78,
'bytesIn1MinMeanRate': 79,
'bytesIn5MinMeanRate': 78,
'bytesOut15MinMeanRate': 888,
'bytesOut1MinMeanRate': 1517,
'bytesOut5MinMeanRate': 1863,
'cpuUsage': 1.4,
'endOffset': 3778553,
'inReassignment': False,
'isLeader': True,
'logSizeInBytes': 280141429,
'numLogSegments': 1,
'partition': 1,
'startOffset': 3701289,
'timestamp': 1547162979624,
'topic': 'doctorkafka.brokerstats',
'underReplicated': False}],
'leaderReplicas': [{'partition': 1, 'topic': 'doctorkafka.brokerstats'}],
'leadersBytesIn15MinRate': 78,
'leadersBytesIn1MinRate': 79,
'leadersBytesIn5MinRate': 78,
'leadersBytesOut15MinRate': 888,
'leadersBytesOut1MinRate': 1517,
'leadersBytesOut5MinRate': 1863,
'logFilesPath': '/mnt/kafka/data',
'name': 'ip-10-10-2-86',
'numLeaders': 1,
'numReplicas': 26,
'rackId': None,
'statsVersion': '0.1.15',
'sysBytesIn1MinRate': 0,
'sysBytesOut1MinRate': 0,
'timestamp': 1547162978968,
'topicsBytesIn15MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 78},
'topicsBytesIn1MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 79},
'topicsBytesIn5MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 78},
'topicsBytesOut15MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 888},
'topicsBytesOut1MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 1517},
'topicsBytesOut5MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 1863},
'totalDiskSpaceInBytes': 4292333535232,
'zkUrl': '10.10.16.238:2181,10.10.2.10:2181,10.10.6.32:2181'}
while the new kafkastats sends bad data:
{'amiId': 'ami-xxxxxxx',
'availabilityZone': 'us-east-1c',
'cpuUsage': 4.0,
'failureReason': None,
'followerReplicas': [{'partition': 13, 'topic': '__consumer_offsets'},
{'partition': 23, 'topic': '__consumer_offsets'},
{'partition': 19, 'topic': '__consumer_offsets'},
{'partition': 17, 'topic': '__consumer_offsets'},
{'partition': 32, 'topic': '__consumer_offsets'},
{'partition': 26, 'topic': '__consumer_offsets'},
{'partition': 7, 'topic': '__consumer_offsets'},
{'partition': 40, 'topic': '__consumer_offsets'},
{'partition': 5, 'topic': '__consumer_offsets'},
{'partition': 3, 'topic': '__consumer_offsets'},
{'partition': 34, 'topic': '__consumer_offsets'},
{'partition': 47, 'topic': '__consumer_offsets'},
{'partition': 16, 'topic': '__consumer_offsets'},
{'partition': 14, 'topic': '__consumer_offsets'},
{'partition': 41, 'topic': '__consumer_offsets'},
{'partition': 10, 'topic': '__consumer_offsets'},
{'partition': 49, 'topic': '__consumer_offsets'},
{'partition': 31, 'topic': '__consumer_offsets'},
{'partition': 29, 'topic': '__consumer_offsets'},
{'partition': 0, 'topic': 'doctorkafka.operator_report'},
{'partition': 25, 'topic': '__consumer_offsets'},
{'partition': 8, 'topic': '__consumer_offsets'},
{'partition': 35, 'topic': '__consumer_offsets'},
{'partition': 4, 'topic': '__consumer_offsets'},
{'partition': 2, 'topic': '__consumer_offsets'}],
'freeDiskSpaceInBytes': 4291677859840,
'hasFailure': True,
'id': 10286,
'inReassignmentReplicas': [],
'instanceType': 'm5.large',
'kafkaVersion': '1.1.1',
'leaderReplicaStats': [{'bytesIn15MinMeanRate': 78,
'bytesIn1MinMeanRate': 81,
'bytesIn5MinMeanRate': 79,
'bytesOut15MinMeanRate': 1013,
'bytesOut1MinMeanRate': 13533,
'bytesOut5MinMeanRate': 2859,
'cpuUsage': 4.0,
'endOffset': 3778503,
'inReassignment': False,
'isLeader': True,
'logSizeInBytes': 280130728,
'numLogSegments': 1,
'partition': 1,
'startOffset': 3701289,
'timestamp': 1547162844124,
'topic': 'doctorkafka.brokerstats',
'underReplicated': False}],
'leaderReplicas': [{'partition': 1, 'topic': 'doctorkafka.brokerstats'}],
'leadersBytesIn15MinRate': 78,
'leadersBytesIn1MinRate': 81,
'leadersBytesIn5MinRate': 79,
'leadersBytesOut15MinRate': 1013,
'leadersBytesOut1MinRate': 13533,
'leadersBytesOut5MinRate': 2859,
'logFilesPath': '/mnt/kafka/data',
'name': 'ip-10-10-2-86',
'numLeaders': 1,
'numReplicas': 26,
'rackId': None,
'statsVersion': '0.1.15',
'sysBytesIn1MinRate': 0,
'sysBytesOut1MinRate': 0,
'timestamp': 1547162843992,
'topicsBytesIn15MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 78},
'topicsBytesIn1MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 81},
'topicsBytesIn5MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 79},
'topicsBytesOut15MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 1013},
'topicsBytesOut1MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 13533},
'topicsBytesOut5MinRate': {'__consumer_offsets': 0,
'doctorkafka.brokerstats': 2859},
'totalDiskSpaceInBytes': 4292333535232,
'zkUrl': '10.10.16.238:2181,10.10.2.10:2181,10.10.6.32:2181'}
@BrianGallew thanks for reporting the issue! we have put a fix #76 for this. can you try again to see if it resolves the problem on your side?
I'm building it right now.
Awesome, yes, that fixed it!