williamdeve/RapidOCRPDF

RapidOCRPDF

依托于RapidOCR仓库，快速提取PDF中文字，包括扫描版PDF、加密版PDF。
如果是可以直接复制的PDF，可以直接使用pdf2docx，不再重复造轮子
如果是扫描版PDF，暂时不支持版式还原，后续有空会考虑加上，日期不定。

使用

安装rapidocr_pdf库

# 基于rapidocr_onnxruntime
pip install rapidocr_pdf[onnxruntime]

# 基于rapidocr_openvino
pip install rapidocr_pdf[openvino]

使用

脚本使用：

from rapidocr_pdf import PDFExtracter

pdf_extracter = PDFExtracter()

pdf_path = 'tests/test_files/direct_and_image.pdf'
texts = pdf_extracter(pdf_path)
print(texts)

命令行使用

$ rapidocr_pdf -h
usage: rapidocr_pdf [-h] [-path FILE_PATH]

options:
-h, --help            show this help message and exit
-path FILE_PATH, --file_path FILE_PATH
                        File path, PDF or images

$ rapidocr_pdf -path tests/test_files/direct_and_image.pdf

输入输出说明

输入：Union[str, Path, bytes]

输出：List [页码, 文本内容, 置信度]，具体参见下例：

[
    ['0', '人之初，性本善。性相近，习相远。', '0.8969868'],
    ['1', 'Men at their birth, are naturally good.', '0.8969868'],
]

更新日志

2023-08-28 v0.0.6 update:
- 解决PyMuPDF版本依赖问题，对应issue #2
2023-04-17 v0.0.2 update:
- 完善使用文档

About

Based on RapidOCR, extract the PDF content.

Apache License 2.0

Languages

Language:Python 100.0%