virantha / pypdfocr

Python script to do PDF OCR conversion using Tesseract

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Request] provide different quality image files for ocr and final merging

sekisushai opened this issue · comments

Hello,
I would like to know if it's possible to provide different images files for ocr and final merging:

  • The high quality images are provided for ocr
  • The same image but more compressed (and small size!) images are provided for the merging, in order to avoid having very big pdf in terms of size.

To be more illustrative : I process scans with scan tailor, then I merge them using pdftk and then your script to apply ocr. However the ocr output depends strongly of the pdf images quality. So if I provide the pdf composed with color scanned high dpi images, the ocr is great but the size of the final pdf as well. I'd like then to have an option to provide more compressed image from scan tailor before merging the final pdf

I'll look into this, but in general, I try to avoid touching the original images as much as possible. I could take a look at providing an option for this, maybe in 0.9.2