PDF file upload not correctly parsed
plopezamaya opened this issue · comments
While using the v3.0.79
It seems that some pdfs are not currently parsed well when using from pypdf import PdfReader
in backend/danswer/file_processing/extract_file_text.py
.
The result is that the llm answers that it cannot retrieve any information from the given document. Should a OCR reader or other framework be used for this ?
You might need to use OCR to extract text, If the PDFs contain scanned images or are image-based.
pytesseract along-with an image processing library like Pillow should work to extract text from images within PDFs.
@arrfandannge Meaning that what is implemented today in danswer should be replaced by an OCR or Pillow ?
@arrfandannge Meaning that what is implemented today in danswer should be replaced by an OCR or Pillow ?
I suggest keeping the current implementation intact but adding code that uses OCR as a fallback method when pypdf doesn't help. This will keep the solution robust and ensure efficiency because extraction with pypdf is generally faster than OCR.
I can try implementing this solution in a separate branch and see how it works.