Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ClassNotFoundException for 'excel.DefaultSource' while using API V2 #789

Closed
1 task done
RupeshKharche opened this issue Oct 2, 2023 · 13 comments
Closed
1 task done

Comments

@RupeshKharche
Copy link

RupeshKharche commented Oct 2, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I am using Spark version 3.5.0 with scala version 2.13.
I am getting java.lang.ClassNotFoundException: excel.DefaultSource for following line of code
Dataset<Row> df = spark.read().format("excel").option("header", "true").load(path);

I have also tried following code but got similar error - ClassNotFoundException: com.crealytics.spark.excel.DefaultSource
Dataset<Row> df = spark.read().format("com.crealytics.spark.excel").option("header", true).load(path);

I have introspected the jar file spark-excel_2.13-3.5.0_0.20.1.jar but it is missing the package com.crealytics.spark.excel.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version: 3.5.0
- Spark-Excel version: 0.20.1
- OS: Windows 10
- Cluster environment: no cluster
- dev env: Java 17 + Maven

Anything else?

No response

@dolfinus
Copy link

dolfinus commented Oct 3, 2023

Compare 0.19.0 with 0.20.1:
изображение

@walkcoolboy
Copy link

walkcoolboy commented Oct 3, 2023

having the same problem after install com.crealytics:spark-excel_2.12:3.4.1_0.20.1 from Maven in Azure Databricks cluster with runtime version 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)

Confirm working after switch to com.crealytics:spark-excel_2.12:3.4.1_0.19.0

@dolfinus
Copy link

dolfinus commented Oct 9, 2023

@nightscape Could you take a look, please?

@phoeph
Copy link

phoeph commented Oct 12, 2023

spark-excel_2.12:3.4.1_0.19.0 YES.
spark-excel_2.12:3.4.1_0.20.1 NO.
spark-excel_2.13-3.5.0_0.20.1 NO.

USING:

val df = spark.read
  .format("com.crealytics.spark.excel")
  .option("header", "true")

// .option("useHeader", "true")
.load("/Users/Leo/unicom/wow_emotes.xlsx")

df.show

@nightscape
Copy link
Owner

Could somebody look into this? I'll only get around to have a look at it in ~1 month because we're in the last stages of house construction and then moving...

@christianknoepfle
Copy link
Contributor

Same here. I was finally trying to update our spark from 3.3 to 3.4 and stumbled over the same issue. It seems to be related to the change from spark 3.3. to 3.4. and for me it is not related to the actual spark excel package version (0.19 and up are all failing even if it works for others). Will look into it...

@christianknoepfle
Copy link
Contributor

I was wrong with my previous statement. The bug was introduced from 0.19 to 0.20(.1) and the issue is that the DataSourceRegister is not packaged into the jar.
image

@MarkusFra
Copy link

MarkusFra commented Oct 27, 2023

I get

Py4JJavaError: An error occurred while calling o588.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: com.crealytics.spark.excel. Please find packages at `https://spark.apache.org/third-party-projects.html`.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:870)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:747)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:797)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:337)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:244)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: com.crealytics.spark.excel.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:733)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:733)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:733)
	... 15 more

when trying to do

spark.read.format("com.crealytics.spark.excel")

in pyspark with spark 3.5.0 Scala 2.12. I guess, it is because of this issue. Is there any update on this? Seems like a packaging error. At this moment the package is unusable at the latest version which is the only spark 3.5.* version.

@christianknoepfle
Copy link
Contributor

Since it is a packaging error I believe it is a mill issue. There were some changes in build.sc since 0.19 as well as an update from mill 0.11.4 to 0.11.5. Unfortunately I am no mill expert nor have made working in intellij (at least my first tries were pretty unsuccessfull). Keep on trying but if some mill expert could help that would be great

@christianknoepfle
Copy link
Contributor

In the meantime you could try the spark=3.4.1 spark excel=0.19.0 version of spark excel with spark 3.5 => 3.4.1_0.19.0 and spark.read.format("excel"). It could work because there were no datasourcev2 API changes from 3.4 to 3.5...

@MarkusFra
Copy link

@christianknoepfle thanks for the advise and the efforts. I temporarily downgraded my cluster to 3.4.1.

@nightscape
Copy link
Owner

The commit introducing the issue seems to be e911d0cf8bd5465f7a3f82289c50045556ba6c91, which is a little bit surprising because it contains only the minimal changes to update Mill.

@nightscape
Copy link
Owner

The incorrect JAR files issue should be solved in 0.20.2.
Please test and comment here if it isn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants