Extension function forEachBatch can be added for DataStreamWriter

Question

Extension function forEachBatch can be added for DataStreamWriter

pihanya opened this issue 2 years ago · comments

I found it difficult to call DataStreamWriter.foreachBatch because source code won't compile until explicit construction of VoidFunction2 is added.

So I suggest adding such an extension for DataStreamWriter:

import org.apache.spark.api.java.function.VoidFunction2
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.streaming.DataStreamWriter

public fun <T> DataStreamWriter<T>.forEachBatch(
    func: (batch: Dataset<T>, batchId: Long) -> Unit
): DataStreamWriter<T> = foreachBatch(
    VoidFunction2 { batch, batchId ->
        func(batch, batchId)
    }
)

Jolan Rensen · Answer 1 · Thu Apr 14 2022 18:25:49 GMT+0800 (China Standard Time)

Good one! will add it along with other streaming functinos next update. Are there any more you found that benefit from having these kinds of extensions?

Mikhail Gostev · Answer 2 · Thu Apr 14 2022 23:49:58 GMT+0800 (China Standard Time)

Jolan, thank you for doing great work in kotlin-spark-api!

The only reason I have come up with issues is that I saw you revived the process of enhancement of kotlin-spark-api.

I had to fork kotlin-spark-api to monkey patch encoders part of kotlin-spark-api. See details in the spoiler below.

Monkey patch description

Original code:
Encoding.kt#L131-L147.

Monkey patch:

public fun <T> generateEncoder(type: KType, cls: KClass<*>): Encoder<T> {
    @Suppress("UNCHECKED_CAST")
    return when {
        (cls !in ENCODERS) && isSupportedClass(cls) -> kotlinClassEncoder(memoizedSchema(type), cls)
        else -> ENCODERS[cls] as? Encoder<T>? ?: Encoders.bean(cls.java)
    } as Encoder<T>
}

private fun isSupportedClass(cls: KClass<*>): Boolean = cls.isData ||
    cls.isSubclassOf(Map::class) ||
    cls.isSubclassOf(Iterable::class) ||
    cls.isSubclassOf(Product::class) ||
    cls.java.isArray

Sorry, but I don't remember why I did this monkey patch.
I was to make a prototype and there was not much time to think.

Except for the monkey patch, I think there was nothing special I did in my fork of kotlin-spark-api.
As the result, I have structurally refactored the library and after refactoring the structure of files appeared to be pretty the same as you did for this repository.
If there will be something to share, be sure that I will create a new issue or make pull requests by myself.

The problems I suffered from when using `kotlin-spark-api`

Spark code generation fails for data classes with enum fields;
Impossibility of creating an encoder for data classes that have a circular dependency on each other (DataClass1 has a field of type DataClass2, DataClass2 has a field of type DataClass1;

When the time will come if problems are actual. I will probably create issues with detailed descriptions and testing on the latest version of kotlin-spark-api.

P.S. I don't have treatment for mentioned problems in my fork.

Jolan Rensen · Answer 3 · Tue Apr 19 2022 19:12:39 GMT+0800 (China Standard Time)

For the monkey patch, it looks like you found a case where there is an encoder in ENCODERS but isSupportedByKotlinClassEncoder returns true so it isn't used. Would be nice to know for which type this is the case though haha.

Enum support (also in data classes) was added here #99.

I just tested and circular dependencies also don't work in Scala: Exception in thread "main" java.lang.UnsupportedOperationException: cannot have circular references in class, but got the circular reference of class org.jetbrains.kotlinx.spark.examples.DataClass1
Is there any specific use case for where a circular reference is needed?

Mikhail Gostev · Answer 4 · Wed Apr 20 2022 07:40:24 GMT+0800 (China Standard Time)

Would be nice to know for which type this is the case

I guess we will start migrating our product to JetBrains/kotlin-spark-api in a few months. I will let you know if the corner case will be found out. Sorry that I am not able to share it now.

Is there any specific use case for where a circular reference is needed?

Our solution integrates with another app using Kafka and persists all the changes to a data warehouse. From Kafka in a single topic, we get updates of tree structures that resemble folder structures in a file system (see spoiler below) where each update is part of the tree.
For now, we are forced to perform ObjectMapper#readValue on raw strings inside map { ... } section what is not quite comfortable and readable in source code.

Example tree structure with circular reference (spoiler)

@JsonTypeInfo(use = JsonTypeInfo.Id.DEDUCTION)
@JsonSubTypes(
    JsonSubTypes.Type(value = CatalogueNode.Folder::class, name = "Folder"),
    JsonSubTypes.Type(value = CatalogueNode.IdentityRef::class, name = "IdentityRef"),
)
sealed interface CatalogueNode {

    val id: Long

    val uid: String

    val seq: Long
    
    data class Folder @JsonCreator constructor(

        override val id: Long,

        override val uid: String,

        override val seq: Long,
        
        val children: List<CatalogueNode>? = null,

        val name: String,

        val description: String? = null,
    ) : CatalogueNode

    data class IdentityRef @JsonCreator constructor(

        override val id: Long,

        override val uid: String,

        override val seq: Long,
        
        // Some other NDA fields
    ) : CatalogueNode
}

By the way, the ability to have java.util.UUID fields in data classes would also be very helpful as all our data comes with UUIDs in it. For now we need to write a lot of boilerplate code that uses UUID#fromString to instantiate UUID instance from String.

Jolan Rensen · Answer 5 · Wed Apr 20 2022 19:06:22 GMT+0800 (China Standard Time)

Your specific example doesn't work because List<CatalogueNode> cannot be encoded. It's an interface which can have functions, values etc. so Spark does not know how to encode that. Only a collection of actual data classes would be allowed. So like

val folderChildren: List<Folder>? = null,
val identityRefChildren: List<IdentityRef>? = null,

But then again, the circular reference appears of course.
Unfortunately, I don't think we have a solution for that. Especially since Spark itself does not support circular references. It makes sense if you consider that Datasets are essentially column/row data structures. If circular references were allowed, an infinite recursion could exist within a cell which cannot be saved.
Some things I found regarding this: https://issues.apache.org/jira/browse/SPARK-33598

As for java.util.UUID, Spark does not support this as well, so I think it's outside the scope of the Kotlin Spark API to add support for this specific class. Usually we only mirror org.apache.spark.sql.Encoders

Pasha Finkelshteyn · Answer 6 · Thu May 26 2022 03:35:36 GMT+0800 (China Standard Time)

It should be mentioned that Spark has its own uuid() function. Otherwise, it could work to just call UUID#toString

Håkon Åmdal · Answer 7 · Thu Jun 02 2022 15:56:22 GMT+0800 (China Standard Time)

Hi. I'm happy to see this has been added and released.

It does, however, have a name case discrepancy with the built-in function. The original spark function is foreachBatch while the one in this repository is forEachBatch. I'm not sure if this is intentional or not, but if it's not intentional, it is perhaps better fixing it sooner rather than later.

Mikhail Gostev · Answer 8 · Thu Jun 02 2022 16:32:43 GMT+0800 (China Standard Time)

@hawkaa, hello! Thank you for the remark.

In Kotlin we follow coding conventions. It is said there: "Names of functions, properties and local variables start with a lowercase letter and use camel case". This is the first argument why forEachBatch is better option vs foreachBatch.

Also in kotlin-spark-api we have forEach function that calls Spark's foreach under the hood. So forEachBatch would be more idiomatic (in terms of kotlin-spark-api) comparing to Spark's foreachBatch.

Therefore, if choosing function naming between forEachBatch vs foreachBatch for kotlin-spark-api it is more propriate to use the first variant, that is available in release v1.1.0.

Jolan Rensen · Answer 9 · Thu Jun 02 2022 19:49:59 GMT+0800 (China Standard Time)

Plus, as a bonus, there is no function overload issue now :) We do have those for reduce {} -> reduceK {} unfortunately.