Many PDF documents don't parse correctly
jasonperrone opened this issue · comments
Jason Perrone commented
Not sure if other people have this problem, but half of the pdfs I throw at this thing return gobbledeegook for text. The other half are fine. Incidentally, pdf-reader processes those same docs no problem.
Erol Fornoles commented
Also noticed that annoying quirk of Tika. One solution I can think of is to drop Tika in favor of pdftotext when parsing PDF files.
Jason Perrone commented
Exactly what I already did.