ansrivas / spark-structured-streaming

Spark structured streaming with Kafka data source and writing to Cassandra

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spark-structured-streaming with Avro kafka messages

amoussoubaruch opened this issue · comments

Hello,

I want to use spark-structured-streaming to process data fetched from kafka messages, and then store as rows cassandra database.
I need one clarification.
The message in kafka are serialized in Avro format.
How can i deserialize the message to json for processing before storing into cassandra?
Is it possible with spark 2.1.0 or 2.1.1?

Any advice or help could be appreciated.
Thanks in advance

Sure, this can be done. If you have the schema you can read it first:

  @transient lazy val schemaString =
    Source.fromURL(getClass.getResource("/message.avsc")).mkString
  // Initialize schema
  @transient lazy val schema: Schema = new Schema.Parser().parse(schemaString)
  @transient lazy val reader = new SpecificDatumReader[GenericRecord](schema)

Now create a deserializer:

def deserializeMessage(msg: Array[Byte]): GenericRecord = {
    try {
      val in = new ByteArrayInputStream(msg)
      val decoder = DecoderFactory.get.directBinaryDecoder(in, null)
      reader.read(null, decoder)
    } catch {
      case e: Exception => null
    }
  }

And finally:

val outDF = df.map{  msg => 
dmsg = deserializeMessage(msg)
your_val = dmsg.get("your_key_from_avro")...
 }

I think this should work.

@ansrivas thanks for your answer!

Kafka return me a dataframe with this field :
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- timestampType: integer (nullable = true)

And data is in value.
i follow your step to decode avro message but i don't value in not Array[Byte] type.
How i can get the message from value?

thanks in advance

I try to use UDF function like this :

spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) =>
MyDeserializerWrapper.deser.deserialize(topic, bytes)
)

from databrick documentation. this is the link : https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html

But i can adapt with your code!

import org.apache.kafka.common.serialization.ByteArrayDeserializer

You can use this in place of new MyDeserializer in the above given link, you will get a byte array as an output. And then after deserialization you will have to write some method to extract the fields individually.

Not sure to good understand.
I have this type for vale : org.apache.spark.sql.DataFrame = [value: binary].
how can i get byte array??

when i try i got this error

<console>:76: error: missing argument list for method deserializeMessage Unapplied methods are only converted to functions when a function type is expected. You can make this conversion explicit by writing deserializeMessage ordeserializeMessage()instead ofdeserializeMessage. deserializeMessage.deserialize(topic, bytes)

Something like this:

object MyDeserializerWrapper {
  val deser =  new ByteArrayDeserializer 
}
spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) => 
  MyDeserializerWrapper.deser.deserialize(topic, bytes)
)

If possible post your code what all you tried.

this is my starting code :

`val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokers1.vi.com:9092,brokers2.vi.com:9092")
.option("subscribe", "prv_ProofOfPlay_int")
.option("startingOffsets", "latest")
.load()

def deserializeMessage(msg: Array[Byte]): GenericRecord = {
try {
val in = new ByteArrayInputStream(msg)
val decoder = DecoderFactory.get.directBinaryDecoder(in, null)
reader.read(null, decoder)
} catch {
case e: Exception => null
}
}

spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) =>
deserializeMessage.deserialize(topic, bytes)
)`

In place of this:

spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) =>
deserializeMessage.deserialize(topic, bytes)
)

try the snippet which I posted above

it's dont work

You can lastly try the snippets in this comment.

 //read avro schema file
  @transient lazy val schemaString =
    Source.fromURL(getClass.getResource("/message.avsc")).mkString
  // Initialize schema
  @transient lazy val schema: Schema = new Schema.Parser().parse(schemaString)
  @transient lazy val reader = new SpecificDatumReader[GenericRecord](schema)

Now create a deserializer:

def deserializeMessage(msg: Array[Byte]): GenericRecord = {
    try {
      val in = new ByteArrayInputStream(msg)
      val decoder = DecoderFactory.get.directBinaryDecoder(in, null)
      reader.read(null, decoder)
    } catch {
      case e: Exception => null
    }
  }

Register your deserializer as spark.udf

object MyDeserializerWrapper {
  val deser =  new ByteArrayDeserializer 
}
spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) => 
  MyDeserializerWrapper.deser.deserialize(topic, bytes)
)

And then select your "deserialized-value" column which you get after applying the above UDF:

val outDF = df.select("$my_deserialized_column").map{  msg => 
dmsg = deserializeMessage(msg)
your_val = dmsg.get("your_key_from_avro")...
 }

And even then if it doesn't work, then try to post your errors and the snippets what all you tried on stack-overflow. In your particular scenario people might share more insights there and it can be resolved quickly.

i have this error :

<console>:71: error: not found: value dmsg val outDF = ds1.select($"value").map{ msg => dmsg = deserializeMessage(msg)

and when i delete dmsg i got this

<console>:72: error: type mismatch; found : org.apache.spark.sql.Row required: Array[Byte] deserializeMessage(msg)

I think the serializer using to encode message is not kafka's default serializer but confluent kafka-avro-serializer.
Is possible that is why the deserialize don't work? Because if i use this

`object MyDeserializerWrapper {
val deser = new ByteArrayDeserializer
}
spark.udf.register("deserialize", (topic: String, bytes: Array[Byte]) =>
MyDeserializerWrapper.deser.deserialize(topic, bytes)
)

val ds2 = ds1.selectExpr("""deserialize("mytopic", value) AS message""")
`
It don't give any error but i have always binary type.

ds2: org.apache.spark.sql.DataFrame = [message: binary]

Hi @amoussoubaruch ,
I had some free time and I implemented a basic example for your use case.
The branch Avro example contains avro-deserialization related code.

HI @ansrivas

Sorry for being late on this. I don't see your last comment. Thanks for that.
I make a test next week with your implementation and i tell u.
For completed my use case, i had used classic Dstream and i write a class for deserialization. I will try to use for next update structured streaming to do that.

Thanks