jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

raw=False argument not working in latest version

surazgyawali opened this issue · comments

screenshot

I tested with few files, the layout is preserved in 2.1.5 but not in 2.2.0.

Thank you for the great work by the way my workflow highly depends upon your module, Thank you again.

The default layout has changed in version 2.2.0 to match what poppler recommends. You now have three choices:

  • pdftotext.PDF(f): default layout
  • pdftotext.PDF(f, physical=True): physical layout, which looks to be what you want
  • pdftotext.PDF(f, raw=True): raw layout

By the way, there is never a need to pass raw=False or physical=False, since those are the defaults.

https://github.com/jalan/pdftotext/blob/master/CHANGES.md

Oh, got it, my bad, tried to look up for the changes too wasn't careful enough to find it ( ashamed ), Thank you for the reply.