tylerdq / pdfca

Batch process text-containing PDF files for corpus and content analysis.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some PDFs can't be read

tylerdq opened this issue · comments

Certain PDFs (in a way that can't seem to be predicted) may not be parsable by PyPDF2 (the library that allows pdfda to work).

Refer to: https://stackoverflow.com/questions/30272269/python-text-extraction-does-not-work-on-some-pdfs

This may be a font issue and may actually be fixable, or it may require a different PDF reading library (if one exists).