[BUG] When Read Excel Files, Several Errors Using Java

Question

[BUG] When Read Excel Files, Several Errors Using Java

yumble opened this issue 5 months ago · comments

yumble commented 5 months ago

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Some problems are occurring when reading Excel files with Spark, Java.

I'm currently making a service,

This service cannot specify the schema of the file. Because users upload their files, the file format is random.

Excel file capture photos

photos of formats that should be displayed

photos that are currently experiencing the problem.

The problems of the current Excel file are as follows.

[] The first column does not apply the date format (it should be displayed like the second picture, but it is not recognized) -> i want to display "yyyy-MM-dd'T'HH:mm:ss.SSSSZ" format.
[] Despite the same cell format for the second and third columns, the second column appears as a string with "₩ ", and the third column has a Scientific notation format

But I want both to be expressed in numbers. ₩ 100,000 -> 100000 ( i want to display this format)

=> In the second column, the numbers are values, and the form adds monetary units and spaces Like 3000000 -> ₩ 3,000,000
In the third column, the plus value of Excel cells is the default value. Like =C3+(C3*0.35)

[] Columns without headers should also be displayed if there is data, but columns without headers are currently ignored.

String workSheet = String.format("'%s'!A1", excel.getWorkSheet());

        Dataset<Row> df = sparkSession.read()
                .format("com.crealytics.spark.excel")
                .option("dataAddress", workSheet)
                .option("header", excel.isDefaultHeader())
                .option("maxColumns", 1000) //todo GUARDRAILS
                .option("columnNameOfCorruptRecord", "true")
                .option("columnNameOfRowNumber", "true")
                .option("inferSchema", "false")
                .option("enforceSchema", "false")
                .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSSZ")
                .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSSZ")
                .load(paths.left);
        df.show();
        df.schema();

I've tried changing the values of the options written here, but it's still the same situation.

Please let me know if you know one problem..

Expected Behavior

[] i want to display "yyyy-MM-dd'T'HH:mm:ss.SSSSZ" format.
[] I want to display plain number without cell formats(styles), without Scientific notation.
[] Columns without headers should also be displayed if there is data, but columns without headers are currently ignored.

Steps To Reproduce

error.xlsx

Environment

- Spark version:

implementation group: 'org.apache.spark', name: 'spark-yarn_2.12', version: '3.5.0'
implementation group: 'org.apache.spark', name: 'spark-core_2.12', version: '3.5.0'
    implementation group: 'org.apache.spark', name: 'spark-sql_2.12', version: '3.5.0'

- Spark-Excel version:

implementation group: 'com.crealytics', name: 'spark-excel_2.12', version: '0.14.0'

- OS: Spring boot 2.7.6

- Cluster environment

Anything else?

No response

github-actions · Answer 1 · Wed Feb 28 2024 10:08:51 GMT+0800 (China Standard Time)

Please check these potential duplicates:

[#710] [BUG] error on reading excel files from abfs with DatasourceV2 API (62.77%)
[#690] [BUG] Data is not being read using streaming approach. (60.28%)
If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.

Martin Mauch · Answer 2 · Wed Feb 28 2024 21:28:16 GMT+0800 (China Standard Time)

Please always use the newest version when reporting bugs. Some things might already have been fixed in the mean time.