linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

java.lang.UnsupportedOperationException: buildReader is not supported for TFRECORD

ZixinChen0520 opened this issue · comments

Hi @junshi15
I met UnsupportedOperationException with version: Scala 2.12 and spark 3

This exception happens when I tried to show or transform the dataframe to rdd. It seems that the method 'buildReader' is not implemented.
My dependency:

            <dependency>
                <groupId>com.linkedin.sparktfrecord</groupId>
                <artifactId>spark-tfrecord_2.12</artifactId>
                <version>0.2.3</version>
            </dependency>

The way I load my tfrecord:

sparkSession
          .read
          .format("tfrecord")
          .options(config.options)
          .option("recordType", "Example")
          .load(myPath) 

here is the exception:

java.lang.UnsupportedOperationException: buildReader is not supported for TFRECORD
	at org.apache.spark.sql.execution.datasources.FileFormat.buildReader(FileFormat.scala:116)
	at org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:137)
	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:478)
	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:468)
	at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:553)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:387)
	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3449)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3617)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:106)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:166)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3615)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3446)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Not sure if I used the load method in a wrong way.

Thanks!

Do you see the same problem in Spark 2.3 or 2.4?
I try the following:

  • Launch spark-shell (Spark 3.0.0)

bin/spark-shell --packages com.linkedin.sparktfrecord:spark-tfrecord_2.12:0.2.3

It worked for me.
The difference is that I did not use .options(config.options). As a test, I am wondering if you can remove that options and try it again.

BTW, buildReader is supported here.

I am wondering if you program actually loaded spark-tfrecord correctly.

I checked the code of your buildReader and our spark-buildReader. Looks like the variables of our spark-buildReader are changed a little bit. I think the problem can be solved if I delete those differences.
Thank you so much for your help!

Assume the problem has been resolved. Feel free to reopen it if not.