yomurb / yomu

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)

Home Page:http://github.com/yomurb/yomu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Many PDF documents don't parse correctly

jasonperrone opened this issue · comments

Not sure if other people have this problem, but half of the pdfs I throw at this thing return gobbledeegook for text. The other half are fine. Incidentally, pdf-reader processes those same docs no problem.

Also noticed that annoying quirk of Tika. One solution I can think of is to drop Tika in favor of pdftotext when parsing PDF files.

Exactly what I already did.