crealytics / spark-excel

A Spark plugin for reading and writing Excel files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Building assembly with 0.18 needs deduplicate

christianknoepfle opened this issue · comments

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We have a "spark sdk" library, which utilizes the crealytics spark-excel library. That library results in a jar that we upload to our artifactory. In another repo we reference that spark sdk to build some spark apps. Since those apps are executed on EMR we have to build assembly jars (using sbt-assembly).

With spark-excel 0.17.2 everything worked fine.

With spark-excel 0.18.0 I got gazillions of "deduplicate errors":

[error] deduplicate: different file contents found in the following:
[error] /root/.cache/coursier/v1/https/repo1.maven.org/maven2/com/crealytics/spark-excel_2.12/3.2.2_0.18.0/spark-excel_2.12-3.2.2_0.18.0.jar:spoiwo/natures/xlsx/BaseXlsx.class
[error] /root/.cache/coursier/v1/https/repo1.maven.org/maven2/com/norbitltd/spoiwo_2.12/2.2.1/spoiwo_2.12-2.2.1.jar:spoiwo/natures/xlsx/BaseXlsx.class
[error]  deduplicate: different file contents found in the following:
[error] /root/.cache/coursier/v1/https/repo1.maven.org/maven2/com/crealytics/spark-excel_2.12/3.2.2_0.18.0/spark-excel_2.12-3.2.2_0.18.0.jar:com/microsoft/schemas/compatibility/AlternateContentDocument$AlternateContent$Choice.class
[error] /root/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/poi/poi-ooxml-lite/5.2.2/poi-ooxml-lite-5.2.2.jar:com/microsoft/schemas/compatibility/AlternateContentDocument$AlternateContent$Choice.class

I am by far no expert in merge strategy stuff so I am pretty puzzled and have no idea where this is coming from. The dependencyTree shows me that spoiwo (and others) are just referenced by spark-excel. So I started to add some MergeStrategy (since the issue was affecting all jars coming from spark-excel I ended up with MergeStrategy.first as catch all) to make it assemble, still I have no idea if this is the right approach to do ( I do not really like that catch all)

Expected Behavior

Would be nice if I do not have to add special merge strategies when using spark-excel (or get a hint what I need to change)

Steps To Reproduce

Since this happens in a complex environment I have no easy way to build something that reproduces the issue.

In case you have some ideas why this happens, pls let me know. If this doesn' lead to a solution I can try to build a simple scenario to reproduce this issue.

Environment

- Spark version: 3.2.1
- Spark-Excel version: 0.18.0
- OS: Windows/Ubuntu
- Cluster environment; NA

Anything else?

Thanks for your help :)

Hi @christianknoepfle, the shading stuff is ugly as hell...
I guess the reason for your issue is that spark-excel both contains these classes bundled, but also still has them in the pom.xml that is uploaded to Maven central.
I have not yet found a way how to get rid of this issue.
You could try if modifying your locally cached version of the pom.xml (it's somewhere in ~/.ivy2, ~/.coursier or ~/.m2 dependening on your build tool) to remove the dependencies solves the issue.

  • since v0.18.0, the poi jar contents are already in spark-excel and since v0.18.0 they are no longer shaded - the shading was causing lots of problems

Hi @christianknoepfle, the shading stuff is ugly as hell... I guess the reason for your issue is that spark-excel both contains these classes bundled, but also still has them in the pom.xml that is uploaded to Maven central. I have not yet found a way how to get rid of this issue. You could try if modifying your locally cached version of the pom.xml (it's somewhere in ~/.ivy2, ~/.coursier or ~/.m2 dependening on your build tool) to remove the dependencies solves the issue.

Heck, some years ago DLL hell, now its shading hell :( Point is that we run the build on our gitlab runner and fiddeling around in the cache won't do it... Nevertheless thanks for the idea, I will play around with it and see if this leads to something...

Hi @christianknoepfle, the shading stuff is ugly as hell... I guess the reason for your issue is that spark-excel both contains these classes bundled, but also still has them in the pom.xml that is uploaded to Maven central. I have not yet found a way how to get rid of this issue. You could try if modifying your locally cached version of the pom.xml (it's somewhere in ~/.ivy2, ~/.coursier or ~/.m2 dependening on your build tool) to remove the dependencies solves the issue.

Out of curiosity: Why are those jars bundled? If the POM references them with explicit versions we should be fine. Or not?

Spark itself has a lot of common unshaded dependencies in a specific version. If spark-excel uses one of these dependencies (possibly transitively) in a different version then one can get the dreaded MethodNotFoundExceptions or ClassNotFoundExceptions.

Since I opened the issue I tried various waS to cope with this duplicate issue. Tried exclude rules and various other magic but to no avail. It seems that the only way of making it work was a catch all MergeStrategy.first (I haven't found a way to limit it to spark-excel). I didn't want to implement that because it might hide other problems and then finding them gets very painful.

So finally I cloned the repo to build our own version of spark-excel. I moved all the shaded deps to lib deps to get a jar containing spark excel 'code' only. And that just worked (for spark 3.2 along with our other custom code).

I do not like that approach either because we have to maintain a copy of spark-excel and keep up with code changes.

So I am wondering if plain spark excel jars (in addition to the existing ones if that is really needed) could also be provided?

But for now I am happy because I finally can use V2 :)

@christianknoepfle one could configure the build such that both a fat and thin jar are published.
One of the artifacts could then have a classifier (in Maven speak) do distinguish them.
The issue is that spark-submit and spark-shell don't support specifying a classifier, so one could only use the variant with classifier.
In SBT it should work.

commented

The issue is that spark-submit and spark-shell don't support specifying a classifier, so one could only use the variant with classifier.
In SBT it should work.

Normally you should not use an assembly-jar (fat-jar) with spark-submit --packages parameter, as this will also resolve and add the jars of the transitive dependencies. But you can also add individual jars with --jars parameter, that's where the assembly-jar should go.
This means that the thin-jar should be published without classifier, so it can be used with --packages. Further the assembly-jar can be published with classifier - it can/should be used with --jars parameter (or uploaded to Azure Synapse which doesn't support using maven artifacts resolving transitive dependencies, right? Side-note: Databricks does...). See proposal in #696.