[BUG] NoSuchElementException: key not found: _corrupt_record

Question

[BUG] NoSuchElementException: key not found: _corrupt_record

hyh1618 opened this issue a year ago · comments

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I'm using pyspark to load excel file, added spark-excel_2.12-3.3.1_0.18.5.jar into ~spark/jars.
Before I read excel, I added "_corrupt_record" into dfschema

        df = spark.read \
            .format("com.crealytics.spark.excel") \
            .option("header", False) \
            .option("dataAddress", dataAddress) \
            .option("mode", "PERMISSIVE") \
            .option("enforceSchema", True) \
            .option("columnNameOfCorruptRecord", "_corrupt_record") \
            .schema(dfschema) \
            .load("c:/tmp/SPARK_EXCEL_SAMPLE.xlsx")

After this line, I got error:
File "[spark]\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o56.cache.
: java.util.NoSuchElementException: key not found: _corrupt_record

If I remove this column from schema, it will create dataframe fine, but all bad rows will be lost, and my goal is to get bad records. Please help here.

Expected Behavior

TO catch bad records inside "_corrupt_record" column.

Steps To Reproduce

No response

Environment

- Spark version: 3.3.0
- Spark-Excel version: spark-excel_2.12-3.3.1_0.18.5.jar
- OS:  windows
- Cluster environment  no

Anything else?

if I add following into ~Spark/Jars:
xmlbeans-3.1.0.jar
poi-ooxml-schemas-4.1.2.jar
commons-collections4-4.4.jar (come with spark)

I will get different error as:
File "[spark]\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.cache.
: java.lang.NoSuchFieldError: Factory

Martin Mauch · Answer 1 · Wed Jun 07 2023 16:10:06 GMT+0800 (China Standard Time)

Can you try .format("excel") instead of .format("com.crealytics.spark.excel")?
This would use the v2 implementation.

hyh1618 · Answer 2 · Thu Jun 08 2023 01:46:18 GMT+0800 (China Standard Time)

not works, I will get following error:
Cell index must be >= 0

Martin Mauch · Answer 3 · Thu Jun 08 2023 05:32:33 GMT+0800 (China Standard Time)

The error message by itself isn't very useful...
Do you get a stacktrace?