There are 2 repositories under pd3f topic.
π PDF text extraction pipeline: self-hosted, local-first, Docker-based
π Dehyphenation of broken text (mainly German), i.e., extracted from a PDF
π Python Package to reconstruct the original continuous text from PDFs with language models
Flair's language models without unnecessary dependencies
Dataset of (mostly German) PDFs used to develop pd3f
Results with pd3f on some PDF datasets
π Website to advertise & document pd3f