[BUG] NoSuchElementException: key not found: _corrupt_record
hyh1618 opened this issue · comments
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
I'm using pyspark to load excel file, added spark-excel_2.12-3.3.1_0.18.5.jar into ~spark/jars.
Before I read excel, I added "_corrupt_record" into dfschema
df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", False) \
.option("dataAddress", dataAddress) \
.option("mode", "PERMISSIVE") \
.option("enforceSchema", True) \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.schema(dfschema) \
.load("c:/tmp/SPARK_EXCEL_SAMPLE.xlsx")
After this line, I got error:
File "[spark]\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o56.cache.
: java.util.NoSuchElementException: key not found: _corrupt_record
If I remove this column from schema, it will create dataframe fine, but all bad rows will be lost, and my goal is to get bad records. Please help here.
Expected Behavior
TO catch bad records inside "_corrupt_record" column.
Steps To Reproduce
No response
Environment
- Spark version: 3.3.0
- Spark-Excel version: spark-excel_2.12-3.3.1_0.18.5.jar
- OS: windows
- Cluster environment no
Anything else?
if I add following into ~Spark/Jars:
xmlbeans-3.1.0.jar
poi-ooxml-schemas-4.1.2.jar
commons-collections4-4.4.jar (come with spark)
I will get different error as:
File "[spark]\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.cache.
: java.lang.NoSuchFieldError: Factory
Can you try .format("excel")
instead of .format("com.crealytics.spark.excel")
?
This would use the v2 implementation.
not works, I will get following error:
Cell index must be >= 0
The error message by itself isn't very useful...
Do you get a stacktrace?