Many PDF documents don't parse correctly

Question

Many PDF documents don't parse correctly

jasonperrone opened this issue 9 years ago · comments

Not sure if other people have this problem, but half of the pdfs I throw at this thing return gobbledeegook for text. The other half are fine. Incidentally, pdf-reader processes those same docs no problem.

Erol Fornoles · Answer 1 · Mon Nov 16 2015 14:43:02 GMT+0800 (China Standard Time)

Also noticed that annoying quirk of Tika. One solution I can think of is to drop Tika in favor of pdftotext when parsing PDF files.

Jason Perrone · Answer 2 · Mon Nov 16 2015 20:26:50 GMT+0800 (China Standard Time)

Exactly what I already did.