jsonkenl / xlsxir

Xlsx parser for the Elixir language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve parsing worksheets

AlexKovalevych opened this issue · comments

I think we can improve parsing worksheets by not parsing them entirely in the memory, but instead parse line by line by request (using Stream).

Here is the idea:

  1. Parse sharedString.xml the same way we do right now.
  2. Create custom Stream for each worksheet.
  3. When the next row is requested - read next line in the worksheet#{i}.xml and add it to the sax parser. Return parsed line.

Of course, if the entire worksheet is a single line (which i think not happens often), we can't do that.

While this memory optimization would only apply to the Xlsxir.get_list/1 use case, I'd find it useful. Currently when parsing a 90K row sheet the RAM usage is growing by 600 to 700 MB.

@pma would you mind helping me test Alex's PR ( #52)? I would appreciate your thoughts on it.

@jsonkennell Will do. I'll compare processing time and RAM usage.

@jsonkennell @AlexKovalevych

I share my test below. The total time of loading and iterating the rows was cut in half.

One comment/suggestion I have is that if it's now possible to stream the output directly from the SAX parser, it would be great if we could have streams all the way up to the public API. Xlsxir.stream would make it easier to compose with things like GenStage/Flow. It would also cut RAM usage since we would be able to avoid loading all the rows at the same time as an intermediate list.

Test: Read a 90984 row sheet (3.1 MB)

iex(1)> {t, {:ok, table_id}} = :timer.tc(fn -> Xlsxir.multi_extract("Test1.xlsx", 0) end)
iex(1)> :timer.tc(fn -> Xlsxir.get_list(table_id) |> Enum.count end)
branch fun time (s)
master Xlsxir.multi_extract/2 33
master Xlsxir.get_list/1 2
performance-boost Xlsxir.multi_extract/2 9
performance-boost Xlsxir.get_list/1 8

Memory usage increase after running test (in MB).

branch proc mem bin mem ets mem
master 150 40 83
performance-boost 39 42 83

Merged #52 so closing for now.