Kotlin / kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

generateEncoder() fails for data class with ByteArray field

mlin opened this issue · comments

I have a data class containing a ByteArray blob field. When I try to work with a dataset of these I get (kotlin-spark-api v1.02, spark v3.1.2)

Exception in thread "main" java.lang.ClassCastException: class org.apache.spark.sql.types.BinaryType$ cannot be cast to class org.apache.spark.sql.types.ObjectType (org.apache.spark.sql.types.BinaryType$ and org.apache.spark.sql.types.ObjectType are in unnamed module of loader 'app')
        at org.apache.spark.sql.KotlinReflection$.toCatalystArray$1(KotlinReflection.scala:609)
        at org.apache.spark.sql.KotlinReflection$.$anonfun$serializerFor$1(KotlinReflection.scala:788)
        at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
        at org.apache.spark.sql.KotlinReflection.cleanUpReflectionObjects(KotlinReflection.scala:1012)
        at org.apache.spark.sql.KotlinReflection.cleanUpReflectionObjects$(KotlinReflection.scala:1011)
        at org.apache.spark.sql.KotlinReflection$.cleanUpReflectionObjects(KotlinReflection.scala:47)
        at org.apache.spark.sql.KotlinReflection$.serializerFor(KotlinReflection.scala:591)
        at org.apache.spark.sql.KotlinReflection$.$anonfun$serializerFor$16(KotlinReflection.scala:761)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.map(TraversableLike.scala:238)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
        at org.apache.spark.sql.KotlinReflection$.$anonfun$serializerFor$1(KotlinReflection.scala:748)
        at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
        at org.apache.spark.sql.KotlinReflection.cleanUpReflectionObjects(KotlinReflection.scala:1012)
        at org.apache.spark.sql.KotlinReflection.cleanUpReflectionObjects$(KotlinReflection.scala:1011)
        at org.apache.spark.sql.KotlinReflection$.cleanUpReflectionObjects(KotlinReflection.scala:47)
        at org.apache.spark.sql.KotlinReflection$.serializerFor(KotlinReflection.scala:591)
        at org.apache.spark.sql.KotlinReflection$.serializerFor(KotlinReflection.scala:578)
        at org.apache.spark.sql.KotlinReflection.serializerFor(KotlinReflection.scala)
        at org.jetbrains.kotlinx.spark.api.ApiV1Kt.kotlinClassEncoder(ApiV1.kt:180)
        at org.jetbrains.kotlinx.spark.api.ApiV1Kt.generateEncoder(ApiV1.kt:167)
...

Artificial repro is merely

import org.jetbrains.kotlinx.spark.api.*

data class BlobTest(val blob: ByteArray) {
    constructor(str: String) : this(str.toByteArray())
}

fun main() {
    withSpark() {
        dsOf(BlobTest("foo"), BlobTest("bar"))
    }
}

Seems like this is the offending cast

https://github.com/JetBrains/kotlin-spark-api/blob/70673efbebd56033425f37bb0d63063509dd96c1/core/3.0/src/main/scala/org/apache/spark/sql/KotlinReflection.scala#L610

where input.dataType : BinaryType is not an ObjectType (but rather a sibling inheriting from DataType). I wonder if any other primitive array would suffer the same?

We fixed the binarytype support in the #134 PR. It should be working fine now :)
The next release will contain the fix, but until then you can try to see if it works for you on the https://github.com/JetBrains/kotlin-spark-api/tree/spark-3.2 branch

Thanks, that's good news! I'll give it a shot when I'm able.

Can I close this?