[BUG] ClassNotFoundException for 'excel.DefaultSource' while using API V2

Question

[BUG] ClassNotFoundException for 'excel.DefaultSource' while using API V2

RupeshKharche opened this issue 10 months ago · comments

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I am using Spark version 3.5.0 with scala version 2.13.
I am getting java.lang.ClassNotFoundException: excel.DefaultSource for following line of code
Dataset<Row> df = spark.read().format("excel").option("header", "true").load(path);

I have also tried following code but got similar error - ClassNotFoundException: com.crealytics.spark.excel.DefaultSource
Dataset<Row> df = spark.read().format("com.crealytics.spark.excel").option("header", true).load(path);

I have introspected the jar file spark-excel_2.13-3.5.0_0.20.1.jar but it is missing the package com.crealytics.spark.excel.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version: 3.5.0
- Spark-Excel version: 0.20.1
- OS: Windows 10
- Cluster environment: no cluster
- dev env: Java 17 + Maven

Anything else?

No response

Maxim Martynov · Answer 1 · Tue Oct 03 2023 15:33:13 GMT+0800 (China Standard Time)

Compare 0.19.0 with 0.20.1:

Li Li · Answer 2 · Tue Oct 03 2023 19:56:23 GMT+0800 (China Standard Time)

having the same problem after install com.crealytics:spark-excel_2.12:3.4.1_0.20.1 from Maven in Azure Databricks cluster with runtime version 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)

Confirm working after switch to com.crealytics:spark-excel_2.12:3.4.1_0.19.0

Maxim Martynov · Answer 3 · Mon Oct 09 2023 18:42:45 GMT+0800 (China Standard Time)

@nightscape Could you take a look, please?

phoeph · Answer 4 · Thu Oct 12 2023 11:26:40 GMT+0800 (China Standard Time)

spark-excel_2.12:3.4.1_0.19.0 YES.
spark-excel_2.12:3.4.1_0.20.1 NO.
spark-excel_2.13-3.5.0_0.20.1 NO.

USING:

val df = spark.read
  .format("com.crealytics.spark.excel")
  .option("header", "true")

// .option("useHeader", "true")
.load("/Users/Leo/unicom/wow_emotes.xlsx")

df.show

Martin Mauch · Answer 5 · Thu Oct 12 2023 22:23:04 GMT+0800 (China Standard Time)

Could somebody look into this? I'll only get around to have a look at it in ~1 month because we're in the last stages of house construction and then moving...

christianknoepfle · Answer 6 · Thu Oct 19 2023 17:22:34 GMT+0800 (China Standard Time)

Same here. I was finally trying to update our spark from 3.3 to 3.4 and stumbled over the same issue. It seems to be related to the change from spark 3.3. to 3.4. and for me it is not related to the actual spark excel package version (0.19 and up are all failing even if it works for others). Will look into it...

christianknoepfle · Answer 7 · Thu Oct 19 2023 19:12:58 GMT+0800 (China Standard Time)

I was wrong with my previous statement. The bug was introduced from 0.19 to 0.20(.1) and the issue is that the DataSourceRegister is not packaged into the jar.

Markus Franke · Answer 8 · Fri Oct 27 2023 18:20:27 GMT+0800 (China Standard Time)

I get

Py4JJavaError: An error occurred while calling o588.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: com.crealytics.spark.excel. Please find packages at `https://spark.apache.org/third-party-projects.html`.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:870)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:747)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:797)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:337)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:244)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: com.crealytics.spark.excel.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:733)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:733)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:733)
	... 15 more

when trying to do

spark.read.format("com.crealytics.spark.excel")

in pyspark with spark 3.5.0 Scala 2.12. I guess, it is because of this issue. Is there any update on this? Seems like a packaging error. At this moment the package is unusable at the latest version which is the only spark 3.5.* version.

christianknoepfle · Answer 9 · Fri Oct 27 2023 20:49:28 GMT+0800 (China Standard Time)

Since it is a packaging error I believe it is a mill issue. There were some changes in build.sc since 0.19 as well as an update from mill 0.11.4 to 0.11.5. Unfortunately I am no mill expert nor have made working in intellij (at least my first tries were pretty unsuccessfull). Keep on trying but if some mill expert could help that would be great

christianknoepfle · Answer 10 · Fri Oct 27 2023 22:47:16 GMT+0800 (China Standard Time)

In the meantime you could try the spark=3.4.1 spark excel=0.19.0 version of spark excel with spark 3.5 => 3.4.1_0.19.0 and spark.read.format("excel"). It could work because there were no datasourcev2 API changes from 3.4 to 3.5...

Markus Franke · Answer 11 · Sat Oct 28 2023 01:53:46 GMT+0800 (China Standard Time)

@christianknoepfle thanks for the advise and the efforts. I temporarily downgraded my cluster to 3.4.1.

Martin Mauch · Answer 12 · Tue Nov 14 2023 08:42:04 GMT+0800 (China Standard Time)

The commit introducing the issue seems to be e911d0cf8bd5465f7a3f82289c50045556ba6c91, which is a little bit surprising because it contains only the minimal changes to update Mill.

Martin Mauch · Answer 13 · Tue Nov 14 2023 18:09:05 GMT+0800 (China Standard Time)

The incorrect JAR files issue should be solved in 0.20.2.
Please test and comment here if it isn't.