After upgrade to 2.2.0 all strings are treated as separate lines
sprnza opened this issue · comments
sprnza commented
Hi there!
I use this snippet to get pdf converted to text:
with open(file.name, 'rb') as f:
pdf = pdftotext.PDF(f)
text=[]
for p in pdf:
text+=p.splitlines()
return text
and with 2.1.6 pdftotext preserve new lines in the source file but in 2.2.0 every parsed string is being treated as separate line. Has something changed in this regard?
Jason Alan Palmer commented
The default behavior was changed to match what poppler recommends. This is really how it should have worked all along. To get the previous behavior, you can use pdftotext.PDF(f, physical=True)
. Sorry for the confusion!
sprnza commented
Thanks! Sorry for bothering and not reading CHANGES.md