jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

#17 in arch linux

dvtate opened this issue · comments

Same issue with same file, Arch Linux, poppler 23.12.0-1, and pdftotext 2.2.2-4

I'm guessing the python bindings are broken somehow.

directly using poppler via pdftotext works fine so I'll just do that for now

Can you attach the file here that you claim doesn't work?

I have nothing to do with arch linux, by the way.

it's the same one from #17 -- https://arxiv.org/pdf/1004.5293.pdf

No worries, I'd reach out to the package maintainer but I'm not sure how to get their email address

it actually doesn't work with any pdf file for me

I don't use arch, so maybe I got something wrong here, but it seems to work:

$ docker run -it archlinux:base
[root@0366535679bd /]# pacman -Sy python-pdftotext
[output cut]
[root@0366535679bd /]# curl --silent --output test.pdf https://arxiv.org/pdf/1004.5293.pdf
[root@0366535679bd /]# python
Python 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801] on linux
>>> import pdftotext
>>> f = open("test.pdf", "rb")
>>> pdf = pdftotext.PDF(f)
>>> len(pdf)
34
>>> pdf[0][:100]
'EPJ manuscript No.\n(will be inserted by the editor)\n\narXiv:1004.5293v2 [physics.ins-det] 7 Jun 2010\n'
>>> 

Interesting, it seems to work fine with rb but not wb+. Regardless, I think I can make this work, thanks!

wb+ truncates the file, so of course that doesn't work

I started with file.write(contents) which works with pypdf