crealytics / spark-excel

A Spark plugin for reading and writing Excel files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] ClassNotFoundException for 'excel.DefaultSource' while using API V2

RupeshKharche opened this issue · comments

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I am using Spark version 3.5.0 with scala version 2.13.
I am getting java.lang.ClassNotFoundException: excel.DefaultSource for following line of code
Dataset<Row> df = spark.read().format("excel").option("header", "true").load(path);

I have also tried following code but got similar error - ClassNotFoundException: com.crealytics.spark.excel.DefaultSource
Dataset<Row> df = spark.read().format("com.crealytics.spark.excel").option("header", true).load(path);

I have introspected the jar file spark-excel_2.13-3.5.0_0.20.1.jar but it is missing the package com.crealytics.spark.excel.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version: 3.5.0
- Spark-Excel version: 0.20.1
- OS: Windows 10
- Cluster environment: no cluster
- dev env: Java 17 + Maven

Anything else?

No response

Compare 0.19.0 with 0.20.1:
изображение

having the same problem after install com.crealytics:spark-excel_2.12:3.4.1_0.20.1 from Maven in Azure Databricks cluster with runtime version 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)

Confirm working after switch to com.crealytics:spark-excel_2.12:3.4.1_0.19.0

@nightscape Could you take a look, please?

spark-excel_2.12:3.4.1_0.19.0 YES.
spark-excel_2.12:3.4.1_0.20.1 NO.
spark-excel_2.13-3.5.0_0.20.1 NO.

USING:

val df = spark.read
  .format("com.crealytics.spark.excel")
  .option("header", "true")

// .option("useHeader", "true")
.load("/Users/Leo/unicom/wow_emotes.xlsx")

df.show

Could somebody look into this? I'll only get around to have a look at it in ~1 month because we're in the last stages of house construction and then moving...

Same here. I was finally trying to update our spark from 3.3 to 3.4 and stumbled over the same issue. It seems to be related to the change from spark 3.3. to 3.4. and for me it is not related to the actual spark excel package version (0.19 and up are all failing even if it works for others). Will look into it...

I was wrong with my previous statement. The bug was introduced from 0.19 to 0.20(.1) and the issue is that the DataSourceRegister is not packaged into the jar.
image

I get

Py4JJavaError: An error occurred while calling o588.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: com.crealytics.spark.excel. Please find packages at `https://spark.apache.org/third-party-projects.html`.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:870)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:747)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:797)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:337)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:244)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: com.crealytics.spark.excel.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:733)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:733)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:733)
	... 15 more

when trying to do

spark.read.format("com.crealytics.spark.excel")

in pyspark with spark 3.5.0 Scala 2.12. I guess, it is because of this issue. Is there any update on this? Seems like a packaging error. At this moment the package is unusable at the latest version which is the only spark 3.5.* version.

Since it is a packaging error I believe it is a mill issue. There were some changes in build.sc since 0.19 as well as an update from mill 0.11.4 to 0.11.5. Unfortunately I am no mill expert nor have made working in intellij (at least my first tries were pretty unsuccessfull). Keep on trying but if some mill expert could help that would be great

In the meantime you could try the spark=3.4.1 spark excel=0.19.0 version of spark excel with spark 3.5 => 3.4.1_0.19.0 and spark.read.format("excel"). It could work because there were no datasourcev2 API changes from 3.4 to 3.5...

@christianknoepfle thanks for the advise and the efforts. I temporarily downgraded my cluster to 3.4.1.

The commit introducing the issue seems to be e911d0cf8bd5465f7a3f82289c50045556ba6c91, which is a little bit surprising because it contains only the minimal changes to update Mill.

The incorrect JAR files issue should be solved in 0.20.2.
Please test and comment here if it isn't.