English | 中文
import os
from pdf2docx import Converter
from glob import glob
# 设置包含 PDF 文件的文件夹路径
folder_path = r'C:\Users\username\Desktop\folder'
# 使用 glob 模块找到所有的 PDF 文件
pdf_files = glob(os.path.join(folder_path, '*.pdf'))
# 遍历所有找到的 PDF 文件
for pdf_file in pdf_files:
# 从 PDF 文件路径创建 DOCX 文件路径(替换扩展名)
docx_file = pdf_file.replace('.pdf', '.docx')
# 创建一个 Converter 对象并进行转换
cv = Converter(pdf_file)
cv.convert(docx_file) # 转换所有页面
cv.close()
print(f'Converted: {pdf_file} to {docx_file}')
- Extract data from PDF with
PyMuPDF
, e.g. text, images and drawings - Parse layout with rule, e.g. sections, paragraphs, images and tables
- Generate docx with
python-docx
-
Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
-
Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
-
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
-
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
-
Parsing pages with multi-processing
It can also be used as a tool to extract table contents since both table content and format/style is parsed.
- Text-based PDF file
- Left to right language
- Normal reading direction, no word transformation / rotation
- Rule-based method can't 100% convert the PDF layout