pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Home Page:https://pymupdf.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to get glyph and convert to character?

june6423 opened this issue · comments

Description of the bug

Description

I want to extract accurate character in pdf.
To make it easier to see, let's only discuss the first line of table1.
스크린샷 2024-03-28 오후 2 50 16

First, I tried to extract words using page.get_text("words").
The return value is 'Dye lmax [nm]a ( 3 /104 M 1 cm 1) lmax [nm]b Eox [V]c E0e0 [eV]d Eox * [V]e'
(corresponding page.get_text("words")[923:942])

The Greek letter lambda was replaced by l, and epsilon was replaced by 3.
The minus character on superscript is missing, and minus character on subscript was replaced by e.

Then, I tried to extract characters using page.get_textpage().extractRAWDICT()['blocks'].
This time, The Greek letter lambda was replaced by l, and epsilon was replaced by 3.
The minus characters are replaced by U+FFFD (�).

Next, I tried page.get_text("variant")
The Greek letter lambda was replaced by l, and epsilon was replaced by 3.
The minus character on superscript was replaced by '\x01', and minus character on subscript was replaced by e.

Finally, I tried to page.get_texttrace().
I got the same result as second try. Instead of U+FFFD, I got Unicode 65533, which is chr(65533) = U+FFFD.

It seems difficult to get the exact characters from a PDF. In particular, the code to convert to U+FFFD is in the executable file(.so file) when the g_use_extra option is enabled, so I couldn't check the source code. Seeing simple characters like minus character(-) fail, I suspect it's an error in the font.

According to this issue, it seems to be possible to restore characters using glyphs even if they are broken.

So, I run page.get_fonts() to get font information.
I got 10 fonts and here's a list of fonts.

0:(479, 'cff', 'Type1', 'IHAPDB+AdvOT863180fb', 'F1', 'WinAnsiEncoding')
1:(480, 'cff', 'Type1', 'IHAPEJ+AdvOTb83ee1dd.B', 'F10', 'WinAnsiEncoding')
2:(733, 'cff', 'Type1', 'IHAPJK+AdvPS4721B4', 'F13', 'WinAnsiEncoding')
3:(734, 'cff', 'Type1', 'IHAPJL+AdvP4C4E51', 'F14', 'WinAnsiEncoding')
4:(483, 'cff', 'Type1', 'IHAPDC+AdvOTb92eb7df.I', 'F2', 'WinAnsiEncoding')
5:(484, 'cff', 'Type1', 'IHAPDD+AdvP4C4E59', 'F3', 'WinAnsiEncoding')
6:(486, 'cff', 'Type1', 'IHAPEE+AdvPS44A44B', 'F5', 'WinAnsiEncoding')
7:(487, 'cff', 'Type1', 'IHAPEF+AdvPS3F4C13', 'F6', 'WinAnsiEncoding')
8:(488, 'cff', 'Type1', 'IHAPEG+AdvOT863180fb+fb', 'F7', '')
9:(489, 'cff', 'Type1', 'IHAPEH+AdvP4C4E74', 'F8', '')

But I have no idea How to get glyph_id from fonts.
Googling the name of the font turns up nothing, and it's not the default PDF font. I want to know how I can get the font information and its glyphs.

Expectation

I want to get the font information and its glyphs. Finally, I want to restore original character using the glyphs.

Environment

print(sys.version,"\n", sys.platform, "\n", fitz.__doc__)

3.10.8 (main, Nov  4 2022, 13:48:29) [GCC 11.2.0]  

linux 

PyMuPDF 1.24.0: Python bindings for the MuPDF 1.24.0 library (rebased implementation).

Python 3.10 running on linux (64-bit).

How to reproduce the bug

How to reproduce the bug

Here's my pdf and I am working on page 6. (with table1)
DyesandPigments2014102196_ZhuWong.pdf

PyMuPDF version

1.24.0

Operating system

Linux

Python version

3.10

This is not a bug - everything seems to work as designed.
I am going to convert this to a Discussions item.