spark-structured-streaming with Avro kafka messages

Question

spark-structured-streaming with Avro kafka messages

amoussoubaruch opened this issue 7 years ago · comments

Hello,

I want to use spark-structured-streaming to process data fetched from kafka messages, and then store as rows cassandra database.
I need one clarification.
The message in kafka are serialized in Avro format.
How can i deserialize the message to json for processing before storing into cassandra?
Is it possible with spark 2.1.0 or 2.1.1?

Any advice or help could be appreciated.
Thanks in advance

Ankur Srivastava · Answer 1 · Sun Jun 25 2017 00:09:28 GMT+0800 (China Standard Time)

Sure, this can be done. If you have the schema you can read it first:

  @transient lazy val schemaString =
    Source.fromURL(getClass.getResource("/message.avsc")).mkString
  // Initialize schema
  @transient lazy val schema: Schema = new Schema.Parser().parse(schemaString)
  @transient lazy val reader = new SpecificDatumReader[GenericRecord](schema)

Now create a deserializer:

def deserializeMessage(msg: Array[Byte]): GenericRecord = {
    try {
      val in = new ByteArrayInputStream(msg)
      val decoder = DecoderFactory.get.directBinaryDecoder(in, null)
      reader.read(null, decoder)
    } catch {
      case e: Exception => null
    }
  }

And finally:

val outDF = df.map{  msg => 
dmsg = deserializeMessage(msg)
your_val = dmsg.get("your_key_from_avro")...
 }

I think this should work.

amoussoubaruch · Answer 2 · Sun Jun 25 2017 01:22:30 GMT+0800 (China Standard Time)

@ansrivas thanks for your answer!

And data is in value.
i follow your step to decode avro message but i don't value in not Array[Byte] type.
How i can get the message from value?

thanks in advance

amoussoubaruch · Answer 3 · Sun Jun 25 2017 01:25:51 GMT+0800 (China Standard Time)

I try to use UDF function like this :

spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) =>
MyDeserializerWrapper.deser.deserialize(topic, bytes)
)

from databrick documentation. this is the link : https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html

But i can adapt with your code!

Ankur Srivastava · Answer 4 · Sun Jun 25 2017 01:53:55 GMT+0800 (China Standard Time)

import org.apache.kafka.common.serialization.ByteArrayDeserializer

You can use this in place of new MyDeserializer in the above given link, you will get a byte array as an output. And then after deserialization you will have to write some method to extract the fields individually.

amoussoubaruch · Answer 5 · Sun Jun 25 2017 02:20:31 GMT+0800 (China Standard Time)

Not sure to good understand.
I have this type for vale : org.apache.spark.sql.DataFrame = [value: binary].
how can i get byte array??

amoussoubaruch · Answer 6 · Sun Jun 25 2017 02:23:37 GMT+0800 (China Standard Time)

when i try i got this error

<console>:76: error: missing argument list for method deserializeMessage Unapplied methods are only converted to functions when a function type is expected. You can make this conversion explicit by writing deserializeMessage ordeserializeMessage()instead ofdeserializeMessage. deserializeMessage.deserialize(topic, bytes)

Ankur Srivastava · Answer 7 · Sun Jun 25 2017 03:06:51 GMT+0800 (China Standard Time)

Something like this:

object MyDeserializerWrapper {
  val deser =  new ByteArrayDeserializer 
}
spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) => 
  MyDeserializerWrapper.deser.deserialize(topic, bytes)
)

If possible post your code what all you tried.

amoussoubaruch · Answer 8 · Sun Jun 25 2017 04:26:56 GMT+0800 (China Standard Time)

this is my starting code :

`val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokers1.vi.com:9092,brokers2.vi.com:9092")
.option("subscribe", "prv_ProofOfPlay_int")
.option("startingOffsets", "latest")
.load()

def deserializeMessage(msg: Array[Byte]): GenericRecord = {
try {
val in = new ByteArrayInputStream(msg)
val decoder = DecoderFactory.get.directBinaryDecoder(in, null)
reader.read(null, decoder)
} catch {
case e: Exception => null
}
}

spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) =>
deserializeMessage.deserialize(topic, bytes)
)`

Ankur Srivastava · Answer 9 · Sun Jun 25 2017 04:39:43 GMT+0800 (China Standard Time)

In place of this:

spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) =>
deserializeMessage.deserialize(topic, bytes)
)

try the snippet which I posted above

amoussoubaruch · Answer 10 · Sun Jun 25 2017 06:02:30 GMT+0800 (China Standard Time)

it's dont work

Ankur Srivastava · Answer 11 · Sun Jun 25 2017 17:23:19 GMT+0800 (China Standard Time)

You can lastly try the snippets in this comment.

 //read avro schema file
  @transient lazy val schemaString =
    Source.fromURL(getClass.getResource("/message.avsc")).mkString
  // Initialize schema
  @transient lazy val schema: Schema = new Schema.Parser().parse(schemaString)
  @transient lazy val reader = new SpecificDatumReader[GenericRecord](schema)

Now create a deserializer:

def deserializeMessage(msg: Array[Byte]): GenericRecord = {
    try {
      val in = new ByteArrayInputStream(msg)
      val decoder = DecoderFactory.get.directBinaryDecoder(in, null)
      reader.read(null, decoder)
    } catch {
      case e: Exception => null
    }
  }

Register your deserializer as spark.udf

object MyDeserializerWrapper {
  val deser =  new ByteArrayDeserializer 
}
spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) => 
  MyDeserializerWrapper.deser.deserialize(topic, bytes)
)

And then select your "deserialized-value" column which you get after applying the above UDF:

val outDF = df.select("$my_deserialized_column").map{  msg => 
dmsg = deserializeMessage(msg)
your_val = dmsg.get("your_key_from_avro")...
 }

And even then if it doesn't work, then try to post your errors and the snippets what all you tried on stack-overflow. In your particular scenario people might share more insights there and it can be resolved quickly.

amoussoubaruch · Answer 12 · Sun Jun 25 2017 19:23:32 GMT+0800 (China Standard Time)

i have this error :

<console>:71: error: not found: value dmsg val outDF = ds1.select($"value").map{ msg => dmsg = deserializeMessage(msg)

amoussoubaruch · Answer 13 · Sun Jun 25 2017 19:25:11 GMT+0800 (China Standard Time)

and when i delete dmsg i got this

<console>:72: error: type mismatch; found : org.apache.spark.sql.Row required: Array[Byte] deserializeMessage(msg)

amoussoubaruch · Answer 14 · Sun Jun 25 2017 19:31:40 GMT+0800 (China Standard Time)

I think the serializer using to encode message is not kafka's default serializer but confluent kafka-avro-serializer.
Is possible that is why the deserialize don't work? Because if i use this

`object MyDeserializerWrapper {
val deser = new ByteArrayDeserializer
}
spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) =>
MyDeserializerWrapper.deser.deserialize(topic, bytes)
)

val ds2 = ds1.selectExpr("""deserialize("mytopic", value) AS message""")
`
It don't give any error but i have always binary type.

ds2: org.apache.spark.sql.DataFrame = [message: binary]

Ankur Srivastava · Answer 15 · Sun Jul 16 2017 00:30:09 GMT+0800 (China Standard Time)

Hi @amoussoubaruch ,
I had some free time and I implemented a basic example for your use case.
The branch Avro example contains avro-deserialization related code.

amoussoubaruch · Answer 16 · Sun Aug 06 2017 22:40:56 GMT+0800 (China Standard Time)

HI @ansrivas

Sorry for being late on this. I don't see your last comment. Thanks for that.
I make a test next week with your implementation and i tell u.
For completed my use case, i had used classic Dstream and i write a class for deserialization. I will try to use for next update structured streaming to do that.

Thanks