linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spark 3.1.1 Caused by: java.lang.ClassNotFoundException: tfrecords.DefaultSource

mullerhai opened this issue · comments

spark 3.1.1

Using Scala version 2.12.10 (Eclipse OpenJ9 VM, Java 11.0.10)
spark-tfrecord 0.3.4
libraryDependencies += "com.linkedin.sparktfrecord" %% "spark-tfrecord" % "0.3.4"
启动方式
spark-shell --jars /data/spark/jars/spark-tfrecord_2.12-0.3.4.jar

import org.apache.spark.sql.SaveMode
val caseFinalModelFeaturePath ="hdfs:///auth/data/model/salecase_warehouse/case_model_feature_snappy.parquet"
val finalInputDf = spark.read.parquet(caseFinalModelFeaturePath)
val caseFinalTFRecordPath ="file:///data/model/salecase_warehouse/case_model_tfrecord"
finalInputDf.coalesce(10).write.format("tfrecords").option("recordType", "Example")
      .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
      .mode(SaveMode.Overwrite)
      .save(caseFinalTFRecordPath)

meet error
java.lang.ClassNotFoundException: Failed to find data source: tfrecords. Please find packages at http://spark.apache.org/third-party-projects.html

java.lang.ClassNotFoundException: Failed to find data source: tfrecords. Please find packages at http://spark.apache.org/third-party-projects.html
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:743)
  at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:993)
  at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:311)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
  ... 58 elided
Caused by: java.lang.ClassNotFoundException: tfrecords.DefaultSource
  at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
  at java.base/java.lang.ClassLoader.loadClassHelper(ClassLoader.java:1185)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:1100)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:1083)
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)
  at org.apache.spark.sql.execution.datasources.DataSource$$$Lambda$7483/0x0000000000000000.apply(Unknown Source)
  at scala.util.Try$.apply(Try.scala:213)
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)
  at org.apache.spark.sql.execution.datasources.DataSource$$$Lambda$4336/0x0000000000000000.apply(Unknown Source)
  at scala.util.Failure.orElse(Try.scala:224)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)
  ... 62 more

both use spark3.3 also get same error

It looks like you did not set the jar properly. Make sure this file is valid: /data/spark/jars/spark-tfrecord_2.12-0.3.4.jar

Or you can try pulling from maven central:
spark-shell --packages com.linkedin.sparktfrecord:spark-tfrecord_2.12:0.4.0
You need maven central repo access for this one to work.

It looks like you did not set the jar properly. Make sure this file is valid: /data/spark/jars/spark-tfrecord_2.12-0.3.4.jar

Or you can try pulling from maven central: spark-shell --packages com.linkedin.sparktfrecord:spark-tfrecord_2.12:0.4.0 You need maven central repo access for this one to work.
I found just we change the symbol word for write & read tfrecord,
old version :.write.format("tfrecords") ,
new version .write.format("tfrecord")

glad you figured it out.

glad you figured it out.

it is my pleasure