Improve performance
tylerdq opened this issue · comments
Tyler Quiring commented
At least with .parquet, there are opportunities to improve speed and reduce disk usage with dataframe binaries via pyarrow's built-in threading and compression options.
There also may be opportunities to multi-thread the PDF extraction itself (using PyPDF2 or switching to an alternate library).
Tyler Quiring commented
The Parquet improvements might be unnecessary. It appears from pyarrow's documentation that the default behavior is already fairly optimized.