danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.

Home Page:https://docs.danswer.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PDF file upload not correctly parsed

plopezamaya opened this issue · comments

While using the v3.0.79 It seems that some pdfs are not currently parsed well when using from pypdf import PdfReader in backend/danswer/file_processing/extract_file_text.py.

The result is that the llm answers that it cannot retrieve any information from the given document. Should a OCR reader or other framework be used for this ?

You might need to use OCR to extract text, If the PDFs contain scanned images or are image-based.
pytesseract along-with an image processing library like Pillow should work to extract text from images within PDFs.

@arrfandannge Meaning that what is implemented today in danswer should be replaced by an OCR or Pillow ?

@arrfandannge Meaning that what is implemented today in danswer should be replaced by an OCR or Pillow ?

I suggest keeping the current implementation intact but adding code that uses OCR as a fallback method when pypdf doesn't help. This will keep the solution robust and ensure efficiency because extraction with pypdf is generally faster than OCR.

I can try implementing this solution in a separate branch and see how it works.