HeartSaVioR / spark-sql-kafka-offset-committer

Kafka offset committer for structured streaming query

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confirmation : manual offset commit on kafka using spark

kishansdpt opened this issue · comments

Hi ,

As part of manual offset commit we followed your approach and implemented like below .Can you please confirm .

  1. As part of onQueryProgress(QueryProgressEvent queryProgress) method we are getting offset and partition details.

  2. from queryProgress -> endOffset getting offset and preparing OffsetAndMetadata

  3. from queryProgress -> endOffset getting partition and preparing TopicPartition

  4. using same group id value to kafkaConsumer which we are processing with spark readStream()

  5. finally commiting offset to kafkaConsumer like bleow

         Map<TopicPartition, OffsetAndMetadata> offsets
         kafkaConsumer.commitSync(offsets);
    

We are able to successfully committing mulitple partitions with offsets by above approach.

Next time when spark try to read kafka topic data with readStream() we use startingOffsetDetails() to get startingOffset which are commited through kafka consumer
Dataset dataFrame = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", kafkaTopic)
.option("enable.auto.commit",false)
.option("group.id", groupIdValue)
.option("startingOffsets", startingOffsetDetails())
.option("failOnDataLoss", "false")
.load()
.selectExpr("partition","offset","deserialize(value) as value");

//code to get commited offsets
//the below method returns data like commited offsets {"epeOffsetTpMutltiPartition":{"0":34278,"1":33778}}
private static String startingOffsetDetails(){
String offsetDetails="";
Properties properties = getProperties();
KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<String, String>(properties);
List partitions =kafkaConsumer.partitionsFor(kafkaTopic);
StringBuffer sb = new StringBuffer();
sb.append("{"" + kafkaTopic + "":{");
for(PartitionInfo partitionInfo : partitions){
TopicPartition tp = new TopicPartition(kafkaTopic, partitionInfo.partition());
OffsetAndMetadata offsetAndMetadata =kafkaConsumer.committed(tp);
if(null==offsetAndMetadata){
sb.append("""+ partitionInfo.partition() + "":" + 0+"," );
}
else{
sb.append("""+ partitionInfo.partition() + "":" + offsetAndMetadata.offset()+"," );
}
}
sb.deleteCharAt(sb.lastIndexOf(","));
sb.append("}}");
offsetDetails =sb.toString();
System.out.println(offsetDetails);

   return offsetDetails;

}

Can you please confirm the same from your end.

Spark is intentionally maintaining the offsets by itself instead of using Kafka's offset checkpoint mechanism. Replacing the feature of checkpointing is not supported.

Please note that this project is to help putting the offsets into Kafka Spark commits into its own for such query so that the offsets information can be leveraged via various existing Kafka ecosystem tools. It's not for replacing checkpoint mechanism in Spark.

Hope this helps.