doctr for scanned pdf

Question

doctr for scanned pdf

InesBenAmor99 opened this issue 22 days ago · comments

I have doctr installed , When I upload a photo, it works very well and extracts text perfectly. However, when I upload a scanned PDF, it keeps processing for a very long time without any response or error. What could be missing?

it works also with a scanned pdf but : for exemple first page is not scanned (contains title or a sentence etc) and all the next pages are scanned , it extracts text from all the pages perfectly , but when the whole pdf is scanned it keeps processing without response.

PSEUDOTENSOR / Jonathan McKinney · Answer 1 · Tue May 14 2024 09:16:17 GMT+0800 (China Standard Time)

By default docTR isn't used if many pages for PDF. It's possible it's using OCR (unstructured package) instead. Can you tell from the command line? I would disable unstructured and OCR from expert panel in UI and try again. You can disable via CLI as well.

pymupdf is the default loader, unless the PDF is all (by pages) a scanned image based PDF, then it will revert progressively to other backup methods.

i.e. it does in order:

pymupdf
pypdf
unstructured pdf
OCR based unstructured pdf
DocTR
As html instead of pdf in case file extension is wrong.

With CLI, you can disable everything except DocTR and see how goes. Then narrow down which thing is taking time. If it's many pages, DocTR does take some time, but I've seen OCR from unstructured take too long and even much longer and worse quality.

Also if you have a PDF that you can share that shows the issue, I'm happy to look. If you need to keep it semi-private, you can email me at jon.mckinney@h2o.ai.

InesBenAmor99 · Answer 2 · Tue May 14 2024 18:19:16 GMT+0800 (China Standard Time)

Hello, thank you for responding. Actually, it's not for a specific PDF as mentioned. In general, if the PDF contains a title or something similar that is not scanned, and all the subsequent pages are scanned, it extracts , doctr works and extract text from the entire file. If the whole PDF is scanned, it continues processing without a response. I'm showing you an example in this video: The first document contains 7 pages. The first page contains a title that is not scanned, and the next 6 pages are all scanned. However, the tool extracts text from all pages (I can verify this through the document viewer ==> view database text). The second document is the document "image_based_pdf_sample" found in the test folder in h2ogpt, which is similar to my problematic documents. As you can see, it continues processing without responding. I shortened the video so I could share it here, but it's still processing, as shown in the screenshot.

testing-doctr_6BSPCc9W.mp4

PSEUDOTENSOR / Jonathan McKinney · Answer 3 · Tue May 14 2024 22:42:24 GMT+0800 (China Standard Time)

Can you do two things?

Add --verbose to he CLI options
do ps -auxwf |grep -A 5 -B 5 generate and share to see what is running.
When it's stuck, do kill -s SIGUSR1 for the that is the "generate.py" pid from above or a deeper fork (lower in list below generate in tree) and share the full output that goes to console. It will include many threads and that may show where it's stuck.

PSEUDOTENSOR / Jonathan McKinney · Answer 4 · Tue May 14 2024 23:48:07 GMT+0800 (China Standard Time)

Also, can you try gpt.h2o.ai -- do you have similar issues? This will help identify if it is an installation or computer issue.