tylerdq / pdfca

Batch process text-containing PDF files for corpus and content analysis.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve performance

tylerdq opened this issue · comments

At least with .parquet, there are opportunities to improve speed and reduce disk usage with dataframe binaries via pyarrow's built-in threading and compression options.

There also may be opportunities to multi-thread the PDF extraction itself (using PyPDF2 or switching to an alternate library).

The Parquet improvements might be unnecessary. It appears from pyarrow's documentation that the default behavior is already fairly optimized.