jsonkenl / xlsxir

Xlsx parser for the Elixir language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve performance

pma opened this issue · comments

Parsing a XLSX with about 40k rows and 7 columns takes 77.38 s.

For comparison, using Apache POI with JInterface for interop takes about 7.7 s.

Creating this as a reminder to profile the code and see if it is possible to improve the performance.

Edit: The current implementation loads the entire worksheet XML into memory and builds at least two intermediate lists to produce the formatted output. We can improve this by using xmerl_sax_parser and in one pass produce the output list. An alternative that would add an external dependency and some C code but could be faster is https://github.com/processone/fast_xml/

Edit 2: The zip file is opened, loaded and decompressed several times (1. validate, 2. extract shared_strings, 3. extract styles, n. for each worksheet). To improve performance we should load it and decompress only once. We can start a GenServer or Agent to keep the state for a single .xlsx to be processed and stop it after finishing. If all binaries are only accessed by this process, I think the garbage collector will just free all the memory, including the large binaries, once the process dies.

Performance is definitely an issue. I've been playing around with ideas for a few weeks now. Thank you @pma for the suggestions, I'll look into using a GenServer as that sounds like a great idea.

@pma I've implemented SAX parsing and Erlang Term Storage (ETS) to Xlsxir. If you have time, would you mind processing your workbooks again and let me know if performance has improved on your end? You'll need to take a look at the docs as I've majorly changed the functionality. Thanks!

@kennellroxco With the SAX implementation the .xlsx that previously took 77s to extract now takes 9s. Xlsxir.get_list/0 runs in under 1ms. Quite an improvement, congrats!

For comparison, my implementation using Apache POI + JInterface takes 15s to do the equivalent of Xlsxir.extract/3 + Xlsxir.get_list/0.

Have you considered xmerl_sax_parser before deciding to use erlsom? If xmerl_sax_parser has comparable performance while avoiding an external dependency, it would be preferable I think.

With the state being stored in ets, how will the library handle concurrent .xlsx parsing, since it seems to be used as a global variable? How would this approach compare to having extract/3 return a reference to a unique ets table or even the intermediate data structure that is currently stored in ets, and then passed to get_list/1 (in terms of performance and API simplicity)?

In my use case I need to iterate the Excel rows and for each row create an Ecto.Changeset. If all changesets are valid I then insert all into a database. I'm thinking that if we are able to pass a function as argument to get_list that takes the row values and returns the mapped value, we could have the final list with just one pass. This mapping could even be done during the SAX parsing, skipping the intermediate sheet representation altogether (Maybe not that relevant since get_list currently takes 300 µs for a sheet with 55K rows).

Another interesting option to consider is to expose something like File.stream!. This would be quite flexible and allow different ways of processing the data (and cover the mapping use case described in the previous paragraph).

@pma I did initially look at xmerl but had trouble understanding the documentation. I was able to find a great example of erlsom from a blog post by @benjamintanweihao, so that's why I went with erlsom. Now that I understand SAX parsing better, I'll take a look at xmerl again to see if I can get rid of that extra dependency.

Thanks for the comments and suggestions. I do plan to implement a better way of parsing multiple worksheets and like your ideas of mapping during the SAX parsing and providing some sort of streaming option, so I will look into those as well. I appreciate you taking the time to test it out and let me know your thoughts!

Initial performance improvement complete. Additional improvements will follow.