get_toc(simple=False) return 'to' point coordinate is not based on top-left origin
charosen opened this issue · comments
Description of the bug
i have a pdf, with outlines(titles) and content below:
1.1 Hello World
1.1.1. first step to hello world
content
and i want to extract all the outline(titles) and their coordinates in page.
when i use get_toc(simple=False)
, fitz return a toc list:
[[1,
'1.1 Hello world',
1,
{'kind': 4,
'xref': 41631,
'page': 0,
'to': Point(0.0, 761.8583),
'zoom': 0.0,
'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25969',
'collapse': True,
'color': (0.0, 0.0, 0.0)}],
[2,
'1.1.1 first step to hello world',
1,
{'kind': 4,
'xref': 41632,
'page': 0,
'to': Point(0.0, 731.8583),
'zoom': 0.0,
'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25972',
'collapse': True,
'color': (0.0, 0.0, 0.0)}],
...
]
the returned 'to' points is not based on top-left origin, but bottom-left origin, because 1.1 Hello world
is above 1.1.1 first step to hello world'
, but Point(0.0, 761.8583) is greater than Point(0.0, 731.8583),
it seems like pdf coordinates, not (py)mupdf coordinates.
how to covert those toc 'to' points to top-bottom coordinates.
How to reproduce the bug
import fitz
document = fitz.open('mypdf.pdf')
toc = document.get_toc(simple=False)
toc results:
[[1,
'1.1 Hello world',
1,
{'kind': 4,
'xref': 41631,
'page': 0,
**'to': Point(0.0, 761.8583),**
'zoom': 0.0,
'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25969',
'collapse': True,
'color': (0.0, 0.0, 0.0)}],
[2,
'1.1.1 first step to hello world',
1,
{'kind': 4,
'xref': 41632,
'page': 0,
**'to': Point(0.0, 731.8583),**
'zoom': 0.0,
'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25972',
'collapse': True,
'color': (0.0, 0.0, 0.0)}],
...
]
PyMuPDF version
1.24.1
Operating system
Linux
Python version
3.9
You did not provide the reproducing file.
You did not provide the reproducing file.
sorry, i could not upload mypdf file for some reason.
However, it is pretty clear that 'to' point in toc is based on bottom-left
origin, not top-left
origin.
i simply want to convert 'to' points to top-left
coordinates.
It is not all clear:
What are we even looking at? Where do the "**" come from?
The TOC entries seem to point to named destinations - are there errors in the PDF? Or in our code?
Did the PDF creator want to point to the bottom left point 🤷♂️?
Have you tried to look at the PDF's names dictionary?
Again: without the file in question we are already wasting time.
Maybe you simply had a question and just wanted to know how to do coordinate transformation?
In that case you shouldn't have submitted an error report but a post in Discussions.
It is not all clear: What are we even looking at? Where do the "**" come from? The TOC entries seem to point to named destinations - are there errors in the PDF? Or in our code? Did the PDF creator want to point to the bottom left point 🤷♂️? Have you tried to look at the PDF's names dictionary?
Again: without the file in question we are already wasting time.
Sorry for the "**" signs, i just want to get bolded fonts, and i already delete them.
my question is:
get_toc(simple=False)
returns a Point(0.0, 761.8583) for 1.1 Hello World
, and a Point(0.0, 731.8583) for 1.1.1. first step to hello world
.
1.1 Hello World
is above 1.1.1. first step to hello world
, however, Point(0.0, 761.8583) is greater than Point(0.0, 731.8583), which is not based on pymupdf top-left coordinates.
Ok - to make some progress, I transferring this thread to Discussions, and we can continue there.