crealytics / spark-excel

A Spark plugin for reading and writing Excel files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to read 250MB file even with 100G driver memory and 100G executor memory

kondisettyravi opened this issue · comments

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When we try to use the jar to read excel data from S3, the shell comes out with OOM issue. Unfortunately, I cannot share the file here.

Below is the code being used

val df = spark.read.format("excel").option("header","true").option("dataAddress", s"'${sheetName}'!A1:XFD1000000").option("maxByteArraySize", "2147483647").load(s"s3://<bucketname>/path/file.xlsx")

As soon as I run the command, it comes out of spark-shell with the OOM error as shown below.

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 32598"...
/usr/lib/spark/bin/spark-shell: line 47: 32598 Killed                  "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
[hadoop@ip-10-0-7-220 ~]$ 

Please suggest. Thanks.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version: 3.1.2
- Spark-Excel version:0.17.1
- OS:
- Cluster environment: EMR

Anything else?

No response

Why not try

.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)

I tried with this option and got

shadeio.poi.util.RecordFormatException: Tried to read data but the maximum length for this record type is 100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
  at shadeio.poi.util.IOUtils.throwRecordTruncationException(IOUtils.java:610)
  at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:249)
  at shadeio.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:220)
  at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:81)
  at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
  at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132)
  at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312)
  at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97)
  at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
  at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
  at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
  at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:110)
  at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:126)
  at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69)
  at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
  at scala.Option.orElse(Option.scala:447)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
  at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:93)
  at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:91)
  at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:22)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:274)
  at scala.Option.map(Option.scala:230)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

Ok, so it fails during schema inference. Are you able to specify a schema manually?

Oh, we have many different files so specifying a schema isn't possible right now. We also tried without inferring the schema and it failed with Stackoverflow exception.

Did you try the combination of specifying a schema and using maxRowsInMemory?