jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

After upgrade to 2.2.0 all strings are treated as separate lines

sprnza opened this issue · comments

Hi there!
I use this snippet to get pdf converted to text:

        with open(file.name, 'rb') as f:
            pdf = pdftotext.PDF(f)
        text=[]
        for p in pdf:
            text+=p.splitlines()
        return text

and with 2.1.6 pdftotext preserve new lines in the source file but in 2.2.0 every parsed string is being treated as separate line. Has something changed in this regard?

The default behavior was changed to match what poppler recommends. This is really how it should have worked all along. To get the previous behavior, you can use pdftotext.PDF(f, physical=True). Sorry for the confusion!

https://github.com/jalan/pdftotext/blob/master/CHANGES.md

Thanks! Sorry for bothering and not reading CHANGES.md