After upgrade to 2.2.0 all strings are treated as separate lines

Question

After upgrade to 2.2.0 all strings are treated as separate lines

sprnza opened this issue 3 years ago · comments

Hi there!
I use this snippet to get pdf converted to text:

        with open(file.name, 'rb') as f:
            pdf = pdftotext.PDF(f)
        text=[]
        for p in pdf:
            text+=p.splitlines()
        return text

and with 2.1.6 pdftotext preserve new lines in the source file but in 2.2.0 every parsed string is being treated as separate line. Has something changed in this regard?

Jason Alan Palmer · Answer 1 · Sun Sep 12 2021 04:12:39 GMT+0800 (China Standard Time)

The default behavior was changed to match what poppler recommends. This is really how it should have worked all along. To get the previous behavior, you can use pdftotext.PDF(f, physical=True). Sorry for the confusion!

https://github.com/jalan/pdftotext/blob/master/CHANGES.md

sprnza · Answer 2 · Mon Sep 13 2021 14:18:29 GMT+0800 (China Standard Time)

Thanks! Sorry for bothering and not reading CHANGES.md