linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Writing TFRecords breaks in Spark 3.2.0

M-Anwar opened this issue · comments

When using the library (com.linkedin.sparktfrecord:spark-tfrecord_2.12:0.3.0) in Spark 3.2.0 to write tfrecords the job throws an exception:

Caused by: java.lang.AbstractMethodError: Receiver class com.linkedin.spark.datasources.tfrecord.TFRecordOutputWriter does not define or inherit an implementation of the resolved method 'abstract java.lang.String path()' of abstract class org.apache.spark.sql.execution.datasources.OutputWriter.

I think this is caused by the change SPARK-26164, which modifies the OutputWriter class to include a path(): String method (source).

The current TFRecordOutputWriter class doesn't have this method, and hence the error (source)

Thanks for reporting the issue.
It looks like this is a breaking change on Spark side.
Do you have a solution already? Your contribution is highly appreciated as we don't have much bandwidth on this project at this moment.

@M-Anwar can you verify #37 fixes the issue you raised?
I plan to merge @tangyl 's PR.

The solution looks good to me, verified that it works on Spark 3.2.0. Thanks @tangyl for the PR!