Confirmation : manual offset commit on kafka using spark

Question

Confirmation : manual offset commit on kafka using spark

kishansdpt opened this issue 4 years ago · comments

kishansdpt commented 4 years ago

Hi ,

As part of manual offset commit we followed your approach and implemented like below .Can you please confirm .

As part of onQueryProgress(QueryProgressEvent queryProgress) method we are getting offset and partition details.
from queryProgress -> endOffset getting offset and preparing OffsetAndMetadata
from queryProgress -> endOffset getting partition and preparing TopicPartition
using same group id value to kafkaConsumer which we are processing with spark readStream()

finally commiting offset to kafkaConsumer like bleow

     Map<TopicPartition, OffsetAndMetadata> offsets
     kafkaConsumer.commitSync(offsets);

We are able to successfully committing mulitple partitions with offsets by above approach.

Next time when spark try to read kafka topic data with readStream() we use startingOffsetDetails() to get startingOffset which are commited through kafka consumer
Dataset dataFrame = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", kafkaTopic)
.option("enable.auto.commit",false)
.option("group.id", groupIdValue)
.option("startingOffsets", startingOffsetDetails())
.option("failOnDataLoss", "false")
.load()
.selectExpr("partition","offset","deserialize(value) as value");

//code to get commited offsets
//the below method returns data like commited offsets {"epeOffsetTpMutltiPartition":{"0":34278,"1":33778}}
private static String startingOffsetDetails(){
String offsetDetails="";
Properties properties = getProperties();
KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<String, String>(properties);
List partitions =kafkaConsumer.partitionsFor(kafkaTopic);
StringBuffer sb = new StringBuffer();
sb.append("{"" + kafkaTopic + "":{");
for(PartitionInfo partitionInfo : partitions){
TopicPartition tp = new TopicPartition(kafkaTopic, partitionInfo.partition());
OffsetAndMetadata offsetAndMetadata =kafkaConsumer.committed(tp);
if(null==offsetAndMetadata){
sb.append("""+ partitionInfo.partition() + "":" + 0+"," );
}
else{
sb.append("""+ partitionInfo.partition() + "":" + offsetAndMetadata.offset()+"," );
}
}
sb.deleteCharAt(sb.lastIndexOf(","));
sb.append("}}");
offsetDetails =sb.toString();
System.out.println(offsetDetails);

   return offsetDetails;

}

Can you please confirm the same from your end.

Jungtaek Lim · Answer 1 · Thu May 14 2020 13:45:51 GMT+0800 (China Standard Time)

Spark is intentionally maintaining the offsets by itself instead of using Kafka's offset checkpoint mechanism. Replacing the feature of checkpointing is not supported.

Please note that this project is to help putting the offsets into Kafka Spark commits into its own for such query so that the offsets information can be leveraged via various existing Kafka ecosystem tools. It's not for replacing checkpoint mechanism in Spark.

Hope this helps.