crealytics / spark-excel

A Spark plugin for reading and writing Excel files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] NoSuchElementException: key not found: _corrupt_record

hyh1618 opened this issue · comments

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I'm using pyspark to load excel file, added spark-excel_2.12-3.3.1_0.18.5.jar into ~spark/jars.
Before I read excel, I added "_corrupt_record" into dfschema

        df = spark.read \
            .format("com.crealytics.spark.excel") \
            .option("header", False) \
            .option("dataAddress", dataAddress) \
            .option("mode", "PERMISSIVE") \
            .option("enforceSchema", True) \
            .option("columnNameOfCorruptRecord", "_corrupt_record") \
            .schema(dfschema) \
            .load("c:/tmp/SPARK_EXCEL_SAMPLE.xlsx")

After this line, I got error:
File "[spark]\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o56.cache.
: java.util.NoSuchElementException: key not found: _corrupt_record

If I remove this column from schema, it will create dataframe fine, but all bad rows will be lost, and my goal is to get bad records. Please help here.

Expected Behavior

TO catch bad records inside "_corrupt_record" column.

Steps To Reproduce

No response

Environment

- Spark version: 3.3.0
- Spark-Excel version: spark-excel_2.12-3.3.1_0.18.5.jar
- OS:  windows
- Cluster environment  no

Anything else?

if I add following into ~Spark/Jars:
xmlbeans-3.1.0.jar
poi-ooxml-schemas-4.1.2.jar
commons-collections4-4.4.jar (come with spark)

I will get different error as:
File "[spark]\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.cache.
: java.lang.NoSuchFieldError: Factory

Can you try .format("excel") instead of .format("com.crealytics.spark.excel")?
This would use the v2 implementation.

not works, I will get following error:
Cell index must be >= 0

The error message by itself isn't very useful...
Do you get a stacktrace?