#17 in arch linux
dvtate opened this issue · comments
Same issue with same file, Arch Linux, poppler 23.12.0-1, and pdftotext 2.2.2-4
I'm guessing the python bindings are broken somehow.
#17 - link
directly using poppler via pdftotext
works fine so I'll just do that for now
Can you attach the file here that you claim doesn't work?
I have nothing to do with arch linux, by the way.
it's the same one from #17 -- https://arxiv.org/pdf/1004.5293.pdf
No worries, I'd reach out to the package maintainer but I'm not sure how to get their email address
it actually doesn't work with any pdf file for me
I don't use arch, so maybe I got something wrong here, but it seems to work:
$ docker run -it archlinux:base
[root@0366535679bd /]# pacman -Sy python-pdftotext
[output cut]
[root@0366535679bd /]# curl --silent --output test.pdf https://arxiv.org/pdf/1004.5293.pdf
[root@0366535679bd /]# python
Python 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801] on linux
>>> import pdftotext
>>> f = open("test.pdf", "rb")
>>> pdf = pdftotext.PDF(f)
>>> len(pdf)
34
>>> pdf[0][:100]
'EPJ manuscript No.\n(will be inserted by the editor)\n\narXiv:1004.5293v2 [physics.ins-det] 7 Jun 2010\n'
>>>
Interesting, it seems to work fine with rb but not wb+. Regardless, I think I can make this work, thanks!
wb+
truncates the file, so of course that doesn't work
I started with file.write(contents)
which works with pypdf