Improve performance

Question

Improve performance

tylerdq opened this issue 5 years ago · comments

At least with .parquet, there are opportunities to improve speed and reduce disk usage with dataframe binaries via pyarrow's built-in threading and compression options.

There also may be opportunities to multi-thread the PDF extraction itself (using PyPDF2 or switching to an alternate library).

Tyler Quiring · Answer 1 · Wed Aug 14 2019 07:04:47 GMT+0800 (China Standard Time)

The Parquet improvements might be unnecessary. It appears from pyarrow's documentation that the default behavior is already fairly optimized.