ChatGPT fails to parse PDF

Question

ChatGPT fails to parse PDF

Hel5inki opened this issue a year ago · comments

When uploading the generated PDF file ChatGPT fails to parse with the following response:

"It appears that the text extraction from the PDF didn't yield any readable content. This could be due to various reasons, such as the text being embedded as images rather than selectable text, or the PDF having some form of encryption or complex formatting that interferes with text extraction."

ChatGPT provided the following code snippet which it likely uses to parse the PDF:

from PyPDF2 import PdfFileReader
import os

# Define the path to the uploaded PDF file
pdf_path = '/mnt/data/resume.pdf'

# Initialize a PDF file reader object
pdf_reader = PdfFileReader(open(pdf_path, 'rb'))

# Initialize a variable to hold the extracted text
extracted_text = ''

# Loop through each page in the PDF file and extract the text
for page_num in range(pdf_reader.getNumPages()):
    page = pdf_reader.getPage(page_num)
    extracted_text += page.extractText()

# Show the first 500 characters of the extracted text to give a sense of its contents
extracted_text[:500]

ChatGPT also provided this stderr statement:

UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]

Theofanis Despoudis · Answer 1 · Fri Sep 22 2023 20:55:29 GMT+0800 (China Standard Time)

ChatGPT gave you a code snippet but did you tested this code snippet to see if it works?