pd3f / pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Home Page:https://pd3f.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Configuring where queue and temporary results are stored

sliedes opened this issue · comments

Hi,

I don't know if this is currently possible or not; maybe it just needs some easy docker change that I haven't figured out yet, or changing the location of a temporary directory in some script. In that case, perhaps it should be better documented.

I tried to extract text from ~20k PDFs over a weekend, but only managed to do so for 489 before running out of RAM on a computer with 32 GiB of RAM. Some of the docker containers seemed to have a ton of stuff under /tmp, which I think was a tmpfs.