milessabin / shapeless

Generic programming for Scala

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tagged types based on anything other than AnyVals produces exception in Spark

DCameronMauch opened this issue · comments

A good explanation of the issue and examples are here:
https://stackoverflow.com/questions/66377920/how-fix-issues-with-spark-and-shapeless-tagged-type-based-on-string
There has been no responses.
I also posed the question in the Gitter / Shapeless channel with no responses.

Basically, any case class that uses a tagged type like type Foo = Int @@ FooTag works just fine with Spark Datasets.

But if I use type Foo = String @@ FooTag, it fails with exception java.lang.ClassNotFoundException: no Java class corresponding to <refinement of java.lang.String with shapeless.tag.Tagged[FooTag]> found.

Not sure if this is a bug in Shapeless, or a limitation with Spark. Is there any kind of work around? Or am I limited to things like Int, Long, Double, Boolean as base type?

I created a custom Spark UDT for the java.util.UUID, and it works great. But when I use tagging on the UUID, same issue.

Thank you! Any guidance would be greatly appreciated.

Oh, if it makes any difference using the following:
Scala 2.11.12
Spark 2.4.7
Shapeless 2.3.3

Hello? Anyone there?

What happens if you replace the tagged type with the underlying type alias manually?

You might have a better luck with Scala 2.12 too.

@DCameronMauch could at least give the full stack trace or a standalone reproduction we could run?
Notebooks import a lot of stuff magically and I don't know what they are.
I would bet on Encoder trying to do reflection based things and failing.

We will be upgrading to Spark 3.x in the upcoming months, along with Scala 2.12. So I can try that then. I'll see if I can put together an online example.

Here is a repo that demonstrates the issue:
https://github.com/DCameronMauch/TaggedType

If you change the method called by the main to the Int version, you can see that it works just fine.
Long as the base type also works.

Here is the stack trace:

Exception in thread "main" java.lang.ClassNotFoundException: no Java class corresponding to <refinement of String with shapeless.tag.Tagged[example.DayOfWeekAsString.DayOfWeekTag]> found
	at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.scala$reflect$runtime$JavaMirrors$JavaMirror$$anonfun$$noClass$1(JavaMirrors.scala:1204)
	at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1242)
	at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203)
	at scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49)
	at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
	at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
	at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
	at org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:726)
	at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:107)
	at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:88)
	at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
	at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:929)
	at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
	at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:87)
	at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$8.apply(ScalaReflection.scala:658)
	at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$8.apply(ScalaReflection.scala:651)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.immutable.List.flatMap(List.scala:355)
	at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:651)
	at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:471)
	at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
	at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:929)
	at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
	at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:471)
	at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:460)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
	at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
	at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
	at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
	at example.Application$.tryDayOfWeekAsString(Application.scala:33)
	at example.Application$.main(Application.scala:6)
	at example.Application.main(Application.scala)

When I run it against the Int version of DayOfWeek, I get this expected output:

+---+---------+
|id |dayOfWeek|
+---+---------+
|1  |1        |
|2  |3        |
|3  |5        |
+---+---------+

New piece of information: I created a "spark3" branch on that repo. Using Spark 3.1 and Scala 2.12. In this environment, the string based tagged type works as expected. So it's either a Spark 2.4 and/or Scala 2.11 thing. I feel like it's still worth investigating, because there are a lot of people stuck on Spark 2.2 or 2.4 with Scala 2.11. We hope to upgrade to the latest in the next few months, so at least the issue will be resolved for us.

Tried a few more combinations. The issue appears to be with Spark 2.4. Even with Scala 2.12, it still fails. But soon as I upgrade to Spark 3.0, it starts working. Couldn't test Spark 3.0 with Scala 2.11, as that's not supported.

The issue is with Spark SQL - it doesn't work with refined types (which is what @@ translates to). Here is the offending code:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L88-L112

It works for primitives because they are special cased in dataTypeFor with isSubtype - so subtypes of primitives are basically treated as primitives. Unfortunately Spark is not very good at offering extension points and I don't think you can define a custom DataType for @@. But if you consider using https://github.com/typelevel/frameless it does let you define custom encoders: http://typelevel.org/frameless/Injection.html