yomurb / yomu

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)

Home Page:http://github.com/yomurb/yomu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Thread crash when parsing some special file

sherllochen opened this issue · comments

This file has name with .doc, but actually is html file. When processing it, yomu will running for a very long time without and response, until I force to kill the thread.
Even if I change the filename to *.html, it still the same, so maybe the file is special.
And then I try to parse with tika directly, it extract text rightly.

fake_doc_but_htm.doc.zip

@sherllochen I don't believe this project is maintained. Suggest try using the newer version of Tika (v1.14). I've forked this project and updated Tika. See https://github.com/abrom/henkei