pd3f / pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Home Page:https://pd3f.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Timeout or similar on some weird PDFs

turian opened this issue · comments

I've queued about 2000 PDFs. However, it seems to stop and wedge halfway through.

It's hard to diagnose which PDF it is, because a painful divide and conquer. (Is there a log I can see so I can easily replicate it for you?)

A potential workaround would be a simple timeout. Things that time out can be removed from the queue, and the user and try them later with a longer timeout. (And also to identify PDFs that are problematic, for future debugging