Hurence / logisland

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

Home Page:https://logisland.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

add startingOffsets, endingOffsets, quitWhenDone parameters to StructuredStream

oalam opened this issue · comments

Spark SQL Kafka as documented here can take startingOffsets and endingOffsets parameters

this could be useful to start a macro-batch stream from data stored into Kafka and end when done !!

use case : timeseries analytics (chunking)

https://dataengi.com/2019/06/06/spark-structured-streaming/

// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1,topic2")
  .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
  .option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
  .load()

I think this is now supporterd (more or less)