Unable to read 250MB file even with 100G driver memory and 100G executor memory

Question

Unable to read 250MB file even with 100G driver memory and 100G executor memory

kondisettyravi opened this issue a year ago · comments

Ravi Teja Kondisetty commented a year ago

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When we try to use the jar to read excel data from S3, the shell comes out with OOM issue. Unfortunately, I cannot share the file here.

Below is the code being used

val df = spark.read.format("excel").option("header","true").option("dataAddress", s"'${sheetName}'!A1:XFD1000000").option("maxByteArraySize", "2147483647").load(s"s3://<bucketname>/path/file.xlsx")

As soon as I run the command, it comes out of spark-shell with the OOM error as shown below.

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 32598"...
/usr/lib/spark/bin/spark-shell: line 47: 32598 Killed                  "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
[hadoop@ip-10-0-7-220 ~]$

Please suggest. Thanks.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version: 3.1.2
- Spark-Excel version:0.17.1
- OS:
- Cluster environment: EMR

Anything else?

No response

Martin Mauch · Answer 1 · Tue Apr 18 2023 22:27:04 GMT+0800 (China Standard Time)

Why not try

.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)

Ravi Teja Kondisetty · Answer 2 · Wed Apr 19 2023 14:24:00 GMT+0800 (China Standard Time)

I tried with this option and got

shadeio.poi.util.RecordFormatException: Tried to read data but the maximum length for this record type is 100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
  at shadeio.poi.util.IOUtils.throwRecordTruncationException(IOUtils.java:610)
  at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:249)
  at shadeio.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:220)
  at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:81)
  at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
  at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132)
  at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312)
  at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97)
  at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
  at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
  at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
  at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:110)
  at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:126)
  at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69)
  at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
  at scala.Option.orElse(Option.scala:447)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
  at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:93)
  at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:91)
  at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:22)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:274)
  at scala.Option.map(Option.scala:230)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

Martin Mauch · Answer 3 · Wed Apr 19 2023 19:27:19 GMT+0800 (China Standard Time)

Ok, so it fails during schema inference. Are you able to specify a schema manually?

Ravi Teja Kondisetty · Answer 4 · Wed Apr 19 2023 19:29:37 GMT+0800 (China Standard Time)

Oh, we have many different files so specifying a schema isn't possible right now. We also tried without inferring the schema and it failed with Stackoverflow exception.

Martin Mauch · Answer 5 · Wed Apr 19 2023 19:32:31 GMT+0800 (China Standard Time)

Did you try the combination of specifying a schema and using maxRowsInMemory?