.docx extractor options

Question

.docx extractor options

fwertz opened this issue 9 years ago · comments

It looks like the options passed to other extractors is not utilized for the .docx extraction process. textract API's are passing an empty string back to the callback for large .docx files (testing with a .docx around 400 pages).

David Bashford · Answer 1 · Tue Aug 11 2015 02:39:57 GMT+0800 (China Standard Time)

The options were removed with 1.0 as I moved away from utilizing unzip and use an in memory unzip. I'll take a look if there's something there I'm missing. Would help if you could point me at a docx I could use that was sufficiently large to cause the issue.

Francis Wertz · Answer 2 · Tue Aug 11 2015 02:45:30 GMT+0800 (China Standard Time)

Sure - I have a sample ipsum doc on google. I haven't tried the URL access APIs, I've just directly downloaded the file and tried extracting text using filesystem APIs.

You can grab it here:

https://docs.google.com/document/d/1mkTIvu9iyueHPerpMxnt7ldkZ0uCqpNz7wgWOhMNWoY/edit?usp=sharing

A quick mention on the google docs - I have duplicated this doc and cut the contents down to 3 pages and textract is functional. It's just the large document that's problematic.

David Bashford · Answer 3 · Tue Aug 11 2015 02:51:38 GMT+0800 (China Standard Time)

Thanks! Have to jump onto some stuff this afternoon so may not get to this until the evening. I did grab the doc though.

David Bashford · Answer 4 · Tue Aug 11 2015 02:54:46 GMT+0800 (China Standard Time)

FWIW I've duplicated your blank string locally.

Francis Wertz · Answer 5 · Tue Aug 11 2015 08:32:31 GMT+0800 (China Standard Time)

Thanks man, I'll check this out tonight. Digging the unit test.

David Bashford · Answer 6 · Tue Aug 11 2015 08:40:15 GMT+0800 (China Standard Time)

Just published as 1.0.3. This was a bit tricky to track down. Banged my head on streams for awhile before I realized that the "all done extracting, here's the text!" callback was being called prematurely. The zipfile's end event fires shortly after you start processing the last entry in the zip file, not once you have finished processing it. Would never notice this if the file was small enough to stream through quickly as the timing would be right.

Thanks for pointing this out!