Redirect output not possible?

Question

Redirect output not possible?

ktnx opened this issue 6 years ago · comments

With a previous version of pdfannots, I never found a way to redirect the output to a file (> or tee etc.) because whenever I added another argument to the call "python pdfannots x.pdf", I quickly (after only a few pages) got an error like this:

Traceback (most recent call last):
File "pdfannots.py", line 452, in
sys.exit(main())
File "pdfannots.py", line 448, in main
prettyprint(annots, args.output, args.wrap, args.sections)
File "pdfannots.py", line 323, in prettyprint
printitem(a, fmttext(a))
File "pdfannots.py", line 304, in printitem
print(msg + "\n", file=outfile)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 15: ordinal not in range(128)

Never got any problems with "python pdfannots x.pdf" alone, worked perfectly every time.
With the new version and the "-o" argument, I still have the same issue. Am I doing anything wrong (I'm not a Python programmer)?

(Also, sometimes I cannot extract source codes from PDF files, those end up in gibberish. Obviously, some encoding problem. )

Andrew Baumann · Answer 1 · Mon Feb 25 2019 01:09:47 GMT+0800 (China Standard Time)

Interesting. You have highlighted some text with an umlaut: 'ü'. When you're outputting to a file, Python is trying to convert that to ascii, and failing.

What is your locale set to? What is your host environment like (host OS, python interpreter version, any special environment variables, etc.)?

ktnx · Answer 2 · Mon Feb 25 2019 02:02:43 GMT+0800 (China Standard Time)

As I get this error only when I try "python pdfannots.py x.pdf > output.txt", I do not think it's an encoding problem. Everything works fine when I run "python pdfannots.py x.pdf".
My locale is German, I'm on Windows 10 and Python 2.7.12 (running in WSL). Also, I just got a new laptop and I have that problem on both machines (with a freshly installed WSL+python2+pdfminer).

I just tried an English-only PDF file and got the same problem:
Warning: failed to retrieve outlines: <type 'exceptions.AttributeError'>
Traceback (most recent call last):
File "pdfannots.py", line 351, in
main()
File "pdfannots.py", line 348, in main
printannots(fh)
File "pdfannots.py", line 334, in printannots
prettyprint(allannots, outlines, mediaboxes)
File "pdfannots.py", line 224, in prettyprint
printitem(fmtpos(a), fmttext(a))
File "pdfannots.py", line 215, in printitem
print(tw.fill(msg) + "\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 13: ordinal not in range(128)

ktnx · Answer 3 · Mon Feb 25 2019 02:18:25 GMT+0800 (China Standard Time)

Addendum: Just ran your new version of pdfannots.py in Python3 instead of Py2, and this combination worked, even with "> output.txt". I think I will stick to this. Thanks a lot anyway!

Andrew Baumann · Answer 4 · Mon Feb 25 2019 05:43:50 GMT+0800 (China Standard Time)

I wasn't aware this code even worked on Python2 any more. It has very different behavior for string output. I'll update the program/README to make Python 3 explicitly required.

ktnx · Answer 5 · Mon Feb 25 2019 06:03:42 GMT+0800 (China Standard Time)

Oh, good to know :)
Thanks!

Btw, do you see any chance that I could extract the source code sections from a programming book's pdf? For most pdf files it used to work fine but from the one I am reading right now, only normal text is extracted correctly, source code ends up being destroyed completely. Copy and paste from the pdf works, however.

Andrew Baumann · Answer 6 · Mon Feb 25 2019 06:10:44 GMT+0800 (China Standard Time)

The string extraction is not very robust. Feel free to create a new issue with the PDF in question, and I may be able to improve things, but no guarantees.