confluentinc / kafka-connect-hdfs

Kafka Connect HDFS connector

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HDFS sink having ConcurrentModificationException issue after some random duration

srch07 opened this issue · comments

Hi Team,

We were looking for Kafka-Connect as an Option as we were removing our Yarn usage in cluster, but retaining hdfs usage for the time being.

S3 sink seems to be decent but HDFS sink is having alot of issues.

Worst of all, every few hours it fails with below stacktrace.

java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
at io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:462)
at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:471)
at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:158)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:401)
at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:598)
at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1400(WorkerSinkTask.java:70)
at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsRevoked(WorkerSinkTask.java:678)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsRevoked(ConsumerCoordinator.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:687)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:414)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:358)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:497)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1274)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1236)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1216)
at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:448)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:229)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
at io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:462)
at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:471)
at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:158)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:401)
at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:598)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:202)
... 7 more

We are using confluentinc/cp-kafka-connect:5.5.3, confluentinc/kafka-connect-hdfs:10.0.0
Our hadoop version is 2.7.5

This is on top of this sink being extremely slow. We are talking about 8 Kafka connect pods, each running with 5 gb and total of 32 worker tasks. Since we have gobblin already in our prod, when we compared to same data. Gobblin produced consistent 240mb per hour of data, for all the topics with 110 gb ram usage. This one is having problem producing 250mb of data using 40 gb of ram for single topic. To be honest, either I am messing things big time here, or HDFS sink is no where close to production usage.

curl -X PUT -H "Content-Type: application/json"
--data '{"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector", "tasks.max":"32", "topics":"", "flush.size":99999, "hdfs.authentication.kerberos":"true",
"format.class":"io.confluent.connect.hdfs.parquet.ParquetFormat","hdfs.url":"hdfs://
:9000", "connect.hdfs.principal":"","connect.hdfs.keytab":".keytab", "hdfs.namenode.principal":"hdfs/****","topics.dir":"/tmp/kc/data","logs.dir":"/tmp/kc/logs",
"partitioner.class":"io.confluent.connect.storage.partitioner.TimeBasedPartitioner","path.format":"'''date_month'''=YYYY-MM/'''date_hour'''=YYYY-MM-dd-HH","partition.duration.ms":900000,"rotate.schedule.interval.ms":900000,"offset.flush.interval.ms":30000,"locale":"en-US","timezone":"UTC","timestamp.extractor":"Record","group.id":"test1"}'
http://$THIS_POD_IP:28083/connectors/quickstart-avro-hdfs-sink-iap/config

Closing issue, as this was fixed in latest version 10.0.1