0xabu / pdfannots

Extracts and formats text annotations from a PDF file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature: Switch to turn off page label support

lvsass opened this issue · comments

Hi!

For my specific use case it would be great to have an option to have pdfminer ignore page labels.

At the moment I am using a script that, in the resulting markdown file, adds links to the specific page in the PDF, like so:

[Page 2](<file.pdf#page=2>)

Obviously, the page labels often don't correspond to the actual page number in the file, which would make this type of switch useful.

I agree that ignoring labels makes sense (they might be nonsense), but I'm pretty nervous about your script. Trying to parse and modify.the markdown sounds pretty fragile. Wouldn't you be better off implementing what you need with the json output, or maybe a custom formatter?

but I'm pretty nervous about your script. Trying to parse and modify.the markdown sounds pretty fragile.

You're not wrong, it is fragile and just a means to automate using text substitution what I would do later by hand, so far it has worked well.

Wouldn't you be better off implementing what you need with the json output, or maybe a custom formatter?

To be honest, I had never looked at the JSON output before – it looks promising but in order to turn this into a script I would first have to learn JSON… And I'm not even sure I know what you mean by custom formatter.

My non-existent coding skills aside, I feel like some kind of implementation of page numbers as they appear in the file would be sensible – if not a switch to ignore them, maybe a switch to show both page labels and "regular" page numbers.

Yes, I agree with the suggestion of a switch to ignore page labels. I'll get around to it ... eventually :)

Sounds good!