#17 in arch linux

Question

#17 in arch linux

dvtate opened this issue 10 months ago · comments

Dustin Van Tate Testa commented 10 months ago

Same issue with same file, Arch Linux, poppler 23.12.0-1, and pdftotext 2.2.2-4

I'm guessing the python bindings are broken somehow.

Dustin Van Tate Testa · Answer 1 · Wed Jan 03 2024 05:18:08 GMT+0800 (China Standard Time)

#17 - link

Dustin Van Tate Testa · Answer 2 · Wed Jan 03 2024 05:21:17 GMT+0800 (China Standard Time)

directly using poppler via pdftotext works fine so I'll just do that for now

Jason Alan Palmer · Answer 3 · Wed Jan 03 2024 05:33:56 GMT+0800 (China Standard Time)

Can you attach the file here that you claim doesn't work?

I have nothing to do with arch linux, by the way.

Dustin Van Tate Testa · Answer 4 · Wed Jan 03 2024 11:39:02 GMT+0800 (China Standard Time)

it's the same one from #17 -- https://arxiv.org/pdf/1004.5293.pdf

No worries, I'd reach out to the package maintainer but I'm not sure how to get their email address

Dustin Van Tate Testa · Answer 5 · Wed Jan 03 2024 11:39:42 GMT+0800 (China Standard Time)

it actually doesn't work with any pdf file for me

Jason Alan Palmer · Answer 6 · Thu Jan 04 2024 00:38:29 GMT+0800 (China Standard Time)

I don't use arch, so maybe I got something wrong here, but it seems to work:

$ docker run -it archlinux:base
[root@0366535679bd /]# pacman -Sy python-pdftotext
[output cut]
[root@0366535679bd /]# curl --silent --output test.pdf https://arxiv.org/pdf/1004.5293.pdf
[root@0366535679bd /]# python
Python 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801] on linux
>>> import pdftotext
>>> f = open("test.pdf", "rb")
>>> pdf = pdftotext.PDF(f)
>>> len(pdf)
34
>>> pdf[0][:100]
'EPJ manuscript No.\n(will be inserted by the editor)\n\narXiv:1004.5293v2 [physics.ins-det] 7 Jun 2010\n'
>>>

Dustin Van Tate Testa · Answer 7 · Thu Jan 04 2024 02:10:48 GMT+0800 (China Standard Time)

Interesting, it seems to work fine with rb but not wb+. Regardless, I think I can make this work, thanks!

Jason Alan Palmer · Answer 8 · Thu Jan 04 2024 02:37:03 GMT+0800 (China Standard Time)

wb+ truncates the file, so of course that doesn't work

Dustin Van Tate Testa · Answer 9 · Thu Jan 04 2024 05:01:44 GMT+0800 (China Standard Time)

I started with file.write(contents) which works with pypdf