sajari / docconv

Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

docx parsing time is slow compared to docx2txt tool

hwo411 opened this issue · comments

Hello!

We've recently stress tested the library in our app and noticed that the docx parsing performance is pretty poor compared to other tools on somewhat big files.

Example docx file:
https://tolstoy.ru/upload/iblock/b22/voina-i-mir.docx

The tools we compared the library to:

  1. https://docx2txt.sourceforge.net/
  2. https://github.com/jgm/pandoc

On my laptop (Ryzen 5800H, 64GB RAM) it parses file in around 40 seconds.
Pandoc has a similar performance.

But docx2txt parses it under a second.

On the servers the difference is much bigger, since we're not running a powerful server yet.

Is there something that can be improved in the docx parsing to make it comparable to docx2txt? At first glance the output is similar, so it's not that they have worse quality at cost of the speed.

I also want to mention that the parsing of pdf file with the same content as this docx file takes less (around 4 seconds), while pdf is larger (30MB vs 4MB).