SheetJS / sheetjs

📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs

Home Page:https://sheetjs.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large XLSX-Files Wrap-up

flaushi opened this issue · comments

There are numerous issues on the problem of loading large XLSX's, with partly contradicting details.
My view of the problem:

  • Some large XLSX files cannot be read.
  • It's not the number of rows for me adding a column lead to the error
  • In my case it's the call to XLSX.readFile(filename, options); that just does not return then
  • Memory limitation is not the reason for the problem: increasing the max_heap_size with --max-old-space-size=10096 does not help

Open questions:

  • Could it help to switch versions? I am using node 17.4 (and electron 13), and sheetjs 0.17.0
  • How can I evaluate if the pro packages offered by @SheetJSDev can handle my files? I cannot upload these file to some website, for privacy reasons.
  • The xlsx cli tool is able to open my problematic sheet, what is the difference between the tool shipped and my program using the same lib???

Most of the open issues involve the hardcoded V8 string length limit (a limit that is unaffected by max-old-space-size and other tuning parameters) and will also hit limits in NodeJS and the CLI tool.

The CLI tool sets the option dense: true in the XLSX.readFile call:

var dense_wb = XLSX.readFile("test.xlsx", { dense: true });

It is an (currently) undocumented option and will change the worksheet type. As long as you stick to the API functions and avoid directly addressing cells like ws["A3"], then you should be able to set the same option. The actual structure of dense workbooks will change at some point, so it's best to stick to functions like sheet_add_aoa to write data into the worksheet and sheet_to_json to extract data from the worksheets.

.

More details below:

The current worksheet structure is a large object where each key is an address. Various V8 changes over the years degraded performance of this access pattern (performance has actually regressed). We've reported bugs to NodeJS and to Chromium, including https://bugs.chromium.org/p/v8/issues/detail?id=6696

dense changes the structure to an array of arrays of cell objects.

.

Array of Arrays should have been the obvious choice. There were two early decisions that led to the current situation: "sparse worksheets" and "web workers".

"sparse worksheets": Imagine a worksheet that only sets A1 and XFD1048576. In the sparse structure, this creates an object with a "A1" key and a "XFD1048576" key. In the dense structure, to ensure the bottom cell could be stored, the worksheet is an array of length 1048576 and the last element is an array of length 16384. This just crashed IE in our testing :(

"web workers": Back in 2012 (It's been nearly 10 years since the first commit!) Web Workers and File API were relatively new. We had explored the idea of passing the binary data (be it a binary string or a Uint8Array) to the worker and passing back the worksheet object. We encountered issues trying to pass back the arrays in IE10 and ultimately concluded that it was safer to send simple objects in Worker messages.

.

IE is long gone, so why not flip the default? There are a few reasons:

"relevance": Photoshop and InDesign CC 2022 suffer the same problem.

"fallbacks": We looked into a migration using Proxy (underlying dense worksheet with a proxy that lets you access cells by address) but proxies are still incredibly inefficient in 2022.

"change": Changing defaults is hard, especially given the size of the developer base. We've recently revisited all of the demos (https://docs.sheetjs.com/docs/demos/) and quite a few of them didn't work in recent testing because some project decided to abandon CJS or changed APIs in ways that required rewriting the demo from scratch.

Thanks a lot for this verbose response. I can confirm that the dense parameter did solve my problem. I have not tested what my new limit is, but for the time being I am fine.

Actually my list of contracts is indeed dense, so this representation is ideal. The mentioned sheet with only two far distant non-empty cells is an edge case. One might think about setting dense to true by default.

Thanks!