PDF file upload not correctly parsed

Question

PDF file upload not correctly parsed

plopezamaya opened this issue 4 months ago · comments

While using the v3.0.79 It seems that some pdfs are not currently parsed well when using from pypdf import PdfReader in backend/danswer/file_processing/extract_file_text.py.

The result is that the llm answers that it cannot retrieve any information from the given document. Should a OCR reader or other framework be used for this ?

arrfan · Answer 1 · Wed Jun 12 2024 17:46:57 GMT+0800 (China Standard Time)

You might need to use OCR to extract text, If the PDFs contain scanned images or are image-based.
pytesseract along-with an image processing library like Pillow should work to extract text from images within PDFs.

plopezamaya · Answer 2 · Wed Jun 12 2024 23:36:11 GMT+0800 (China Standard Time)

@arrfandannge Meaning that what is implemented today in danswer should be replaced by an OCR or Pillow ?

arrfan · Answer 3 · Thu Jun 13 2024 01:14:26 GMT+0800 (China Standard Time)

@arrfandannge Meaning that what is implemented today in danswer should be replaced by an OCR or Pillow ?

I suggest keeping the current implementation intact but adding code that uses OCR as a fallback method when pypdf doesn't help. This will keep the solution robust and ensure efficiency because extraction with pypdf is generally faster than OCR.

I can try implementing this solution in a separate branch and see how it works.