dbashford / textract

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

.docx extractor options

fwertz opened this issue · comments

It looks like the options passed to other extractors is not utilized for the .docx extraction process. textract API's are passing an empty string back to the callback for large .docx files (testing with a .docx around 400 pages).

The options were removed with 1.0 as I moved away from utilizing unzip and use an in memory unzip. I'll take a look if there's something there I'm missing. Would help if you could point me at a docx I could use that was sufficiently large to cause the issue.

Sure - I have a sample ipsum doc on google. I haven't tried the URL access APIs, I've just directly downloaded the file and tried extracting text using filesystem APIs.

You can grab it here:

https://docs.google.com/document/d/1mkTIvu9iyueHPerpMxnt7ldkZ0uCqpNz7wgWOhMNWoY/edit?usp=sharing

A quick mention on the google docs - I have duplicated this doc and cut the contents down to 3 pages and textract is functional. It's just the large document that's problematic.

Thanks! Have to jump onto some stuff this afternoon so may not get to this until the evening. I did grab the doc though.

FWIW I've duplicated your blank string locally.

Thanks man, I'll check this out tonight. Digging the unit test.

Just published as 1.0.3. This was a bit tricky to track down. Banged my head on streams for awhile before I realized that the "all done extracting, here's the text!" callback was being called prematurely. The zipfile's end event fires shortly after you start processing the last entry in the zip file, not once you have finished processing it. Would never notice this if the file was small enough to stream through quickly as the timing would be right.

Thanks for pointing this out!