Unable to read 250MB file even with 100G driver memory and 100G executor memory
kondisettyravi opened this issue · comments
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
When we try to use the jar to read excel data from S3, the shell comes out with OOM issue. Unfortunately, I cannot share the file here.
Below is the code being used
val df = spark.read.format("excel").option("header","true").option("dataAddress", s"'${sheetName}'!A1:XFD1000000").option("maxByteArraySize", "2147483647").load(s"s3://<bucketname>/path/file.xlsx")
As soon as I run the command, it comes out of spark-shell
with the OOM error as shown below.
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 32598"...
/usr/lib/spark/bin/spark-shell: line 47: 32598 Killed "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
[hadoop@ip-10-0-7-220 ~]$
Please suggest. Thanks.
Expected Behavior
No response
Steps To Reproduce
No response
Environment
- Spark version: 3.1.2
- Spark-Excel version:0.17.1
- OS:
- Cluster environment: EMR
Anything else?
No response
Why not try
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
I tried with this option and got
shadeio.poi.util.RecordFormatException: Tried to read data but the maximum length for this record type is 100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
at shadeio.poi.util.IOUtils.throwRecordTruncationException(IOUtils.java:610)
at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:249)
at shadeio.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:220)
at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:81)
at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132)
at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:110)
at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:126)
at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69)
at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42)
at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:93)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:91)
at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:22)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:274)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
... 47 elided
Ok, so it fails during schema inference. Are you able to specify a schema manually?
Oh, we have many different files so specifying a schema isn't possible right now. We also tried without inferring the schema and it failed with Stackoverflow exception.
Did you try the combination of specifying a schema and using maxRowsInMemory
?