Easily handle PDFs, extract readable text, recognize image text with OCR and clean up formatting to make it more suitable for building knowledge bases.
Added documentation tutorial on how to integrate with graphrag
Doc2X is a new universal document OCR tool that can convert images or PDF files into Markdown/LaTeX text with formulas and text formatting. It performs better than similar tools in most scenarios. pdfdeal
provides abstract packaged classes to use Doc2X for requests.
Use various OCR or PDF recognition tools to identify images and add them to the original text. You can set the output format to use PDF, which will ensure that the recognized text retains the same page numbers as the original in the new PDF. It also offers various practical file processing tools.
After processing PDFs, you can achieve better recognition rates when used with knowledge base applications such as graphrag, Dify, and FastGPT.
It is recommended to use Doc2X for the best results.
For example, if graphrag does not support recognizing PDFs, you can use doc2x
to convert it into txt documents for use.
Or for knowledge base applications, you can also use pdfdeal
to enhance documents. Below are the effects of original PDF/OCR enhancement/Doc2X processing in Dify:
You can view new features under development here!
For details, please refer to the documentation
Or check out the documentation repository pdfdeal-docs.
For details, please refer to the documentation
Install from PyPI:
pip install --upgrade pdfdeal
When using "pytesseract", make sure that tesseract is installed first:
pip install 'pdfdeal[pytesseract]'
from pdfdeal import deal_pdf, get_files
files, rename = get_files("tests/pdf", "pdf", "md")
output_path, failed, flag = deal_pdf(
pdf_file=files,
output_format="md",
ocr="pytesseract",
language=["eng"],
output_path="Output",
output_names=rename,
)
for f in output_path:
print(f"Save processed file to {f}")
from pdfdeal import Doc2X
from pdfdeal import get_files
client = Doc2X()
file_list, rename = get_files(path="tests/pdf", mode="pdf", out="pdf")
success, failed, flag = client.pdfdeal(
pdf_file=file_list,
output_path="./Output/test/multiple/pdfdeal",
output_names=rename,
)
print(success)
print(failed)
print(flag)