0xabu / pdfannots

Extracts and formats text annotations from a PDF file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Redirect output not possible?

ktnx opened this issue · comments

commented

With a previous version of pdfannots, I never found a way to redirect the output to a file (> or tee etc.) because whenever I added another argument to the call "python pdfannots x.pdf", I quickly (after only a few pages) got an error like this:

Traceback (most recent call last):
File "pdfannots.py", line 452, in
sys.exit(main())
File "pdfannots.py", line 448, in main
prettyprint(annots, args.output, args.wrap, args.sections)
File "pdfannots.py", line 323, in prettyprint
printitem(a, fmttext(a))
File "pdfannots.py", line 304, in printitem
print(msg + "\n", file=outfile)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 15: ordinal not in range(128)

Never got any problems with "python pdfannots x.pdf" alone, worked perfectly every time.
With the new version and the "-o" argument, I still have the same issue. Am I doing anything wrong (I'm not a Python programmer)?

(Also, sometimes I cannot extract source codes from PDF files, those end up in gibberish. Obviously, some encoding problem. )

Interesting. You have highlighted some text with an umlaut: 'ü'. When you're outputting to a file, Python is trying to convert that to ascii, and failing.

What is your locale set to? What is your host environment like (host OS, python interpreter version, any special environment variables, etc.)?

commented

As I get this error only when I try "python pdfannots.py x.pdf > output.txt", I do not think it's an encoding problem. Everything works fine when I run "python pdfannots.py x.pdf".
My locale is German, I'm on Windows 10 and Python 2.7.12 (running in WSL). Also, I just got a new laptop and I have that problem on both machines (with a freshly installed WSL+python2+pdfminer).

I just tried an English-only PDF file and got the same problem:
Warning: failed to retrieve outlines: <type 'exceptions.AttributeError'>
Traceback (most recent call last):
File "pdfannots.py", line 351, in
main()
File "pdfannots.py", line 348, in main
printannots(fh)
File "pdfannots.py", line 334, in printannots
prettyprint(allannots, outlines, mediaboxes)
File "pdfannots.py", line 224, in prettyprint
printitem(fmtpos(a), fmttext(a))
File "pdfannots.py", line 215, in printitem
print(tw.fill(msg) + "\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 13: ordinal not in range(128)

commented

Addendum: Just ran your new version of pdfannots.py in Python3 instead of Py2, and this combination worked, even with "> output.txt". I think I will stick to this. Thanks a lot anyway!

I wasn't aware this code even worked on Python2 any more. It has very different behavior for string output. I'll update the program/README to make Python 3 explicitly required.

commented

Oh, good to know :)
Thanks!

Btw, do you see any chance that I could extract the source code sections from a programming book's pdf? For most pdf files it used to work fine but from the one I am reading right now, only normal text is extracted correctly, source code ends up being destroyed completely. Copy and paste from the pdf works, however.

The string extraction is not very robust. Feel free to create a new issue with the PDF in question, and I may be able to improve things, but no guarantees.