metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.

Home Page:http://www.metachris.com/pdfx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

timeout option

DanielRuf opened this issue · comments

Hi,

pdfx is very helpful for us to analyze a few things. Thanks for creating pdfx.

But we have a small problem. When a pdf file contains much text pdfx / python only fails after the "too many recursions" error is thrown.

It would be helpful to have a max-timeout option to prevent that pdfx tries to parse files for 45 minutes and more (in our case).

And another small question: how could we scan / check many files at once in the best way? So far we run single pdfx commands from a bash script and wait until every command has finished. Using the & trick would cause some issues with the job scheduler of the OS and that the whole OS freezes.

Could you post the full stack trace, and perhaps an example PDF? Please reopen the issue with those, thanks 🙏

Please reopen the issue with those, thanks

Only you can reopen the issue ;-)

Here is an example file:

54013162437.pdf

This stacktrace is produced:

pdfx.log