pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Home Page:https://pymupdf.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

get_toc(simple=False) return 'to' point coordinate is not based on top-left origin

charosen opened this issue · comments

Description of the bug

i have a pdf, with outlines(titles) and content below:

1.1 Hello World

1.1.1. first step to hello world

content

and i want to extract all the outline(titles) and their coordinates in page.

when i use get_toc(simple=False), fitz return a toc list:

[[1,
  '1.1 Hello world',
  1,
  {'kind': 4,
   'xref': 41631,
   'page': 0,
   'to': Point(0.0, 761.8583),
   'zoom': 0.0,
   'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25969',
   'collapse': True,
   'color': (0.0, 0.0, 0.0)}],
 [2,
  '1.1.1 first step to hello world',
  1,
  {'kind': 4,
   'xref': 41632,
   'page': 0,
   'to': Point(0.0, 731.8583),
   'zoom': 0.0,
   'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25972',
   'collapse': True,
   'color': (0.0, 0.0, 0.0)}],
...
]

the returned 'to' points is not based on top-left origin, but bottom-left origin, because 1.1 Hello world is above 1.1.1 first step to hello world', but Point(0.0, 761.8583) is greater than Point(0.0, 731.8583),

it seems like pdf coordinates, not (py)mupdf coordinates.

how to covert those toc 'to' points to top-bottom coordinates.

How to reproduce the bug

import fitz

document = fitz.open('mypdf.pdf')

toc = document.get_toc(simple=False)

toc results:

[[1,
  '1.1 Hello world',
  1,
  {'kind': 4,
   'xref': 41631,
   'page': 0,
   **'to': Point(0.0, 761.8583),**
   'zoom': 0.0,
   'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25969',
   'collapse': True,
   'color': (0.0, 0.0, 0.0)}],
 [2,
  '1.1.1 first step to hello world',
  1,
  {'kind': 4,
   'xref': 41632,
   'page': 0,
   **'to': Point(0.0, 731.8583),**
   'zoom': 0.0,
   'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25972',
   'collapse': True,
   'color': (0.0, 0.0, 0.0)}],
...
]

PyMuPDF version

1.24.1

Operating system

Linux

Python version

3.9

You did not provide the reproducing file.

You did not provide the reproducing file.

sorry, i could not upload mypdf file for some reason.

However, it is pretty clear that 'to' point in toc is based on bottom-left origin, not top-left origin.

i simply want to convert 'to' points to top-left coordinates.

It is not all clear:
What are we even looking at? Where do the "**" come from?
The TOC entries seem to point to named destinations - are there errors in the PDF? Or in our code?
Did the PDF creator want to point to the bottom left point 🤷‍♂️?
Have you tried to look at the PDF's names dictionary?

Again: without the file in question we are already wasting time.

Maybe you simply had a question and just wanted to know how to do coordinate transformation?
In that case you shouldn't have submitted an error report but a post in Discussions.

It is not all clear: What are we even looking at? Where do the "**" come from? The TOC entries seem to point to named destinations - are there errors in the PDF? Or in our code? Did the PDF creator want to point to the bottom left point 🤷‍♂️? Have you tried to look at the PDF's names dictionary?

Again: without the file in question we are already wasting time.

Sorry for the "**" signs, i just want to get bolded fonts, and i already delete them.

my question is:

get_toc(simple=False) returns a Point(0.0, 761.8583) for 1.1 Hello World, and a Point(0.0, 731.8583) for 1.1.1. first step to hello world.

1.1 Hello World is above 1.1.1. first step to hello world, however, Point(0.0, 761.8583) is greater than Point(0.0, 731.8583), which is not based on pymupdf top-left coordinates.

Ok - to make some progress, I transferring this thread to Discussions, and we can continue there.