pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Home Page:https://pymupdf.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hind dot not extracting when extracting text from pdf

aleem75321 opened this issue · comments

Description of the bug

When I extracted text from a PDF containing Marathi language, Hindi grammar dots did not appear on some texts.

If you look at the text output extract given below, you will find that there are some words tu hai par us and in place of binod there are some numerical numbers.

test.pdf

How to reproduce the bug

When I extracted text from a PDF containing Marathi language, Hindi grammar dots did not appear on some texts.

If you look at the text output extract given below, you will find that there are some words tu hai par us and in place of binod there are some numerical numbers.

test.pdf
Simple file attached

I have also attached some output and code

#Code synatx
doc=fitz.open("test_pages/02032024_MTM_MP_0002_1_COL_R1.pdf")
page=doc[0]
page.clean_contents()
black = fitz.pdfcolor["red"]
blocks=page.get_text("rawdict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
for b in blocks:
for l in b["lines"]:
for s in l["spans"]:
if s["size"]>32 and s['color']==2236191:
for c in s["chars"]:
print(c)

#output
{'origin': (28.978105545043945, 128.46484375), 'bbox': (28.978105545043945, 67.26271057128906, 61.44647979736328, 162.04791259765625), 'c': 'घ'}
{'origin': (61.44648361206055, 128.46484375), 'bbox': (61.44648361206055, 67.26271057128906, 81.60342407226562, 162.04791259765625), 'c': 'र'}
{'origin': (81.60343170166016, 128.46484375), 'bbox': (81.60343170166016, 67.26271057128906, 95.48396301269531, 162.04791259765625), 'c': ' '}
{'origin': (94.88040924072266, 128.46484375), 'bbox': (94.88040924072266, 67.26271057128906, 139.47915649414062, 162.04791259765625), 'c': 'ख'}
{'origin': (139.47915649414062, 128.46484375), 'bbox': (139.47915649414062, 67.26271057128906, 159.6361083984375, 162.04791259765625), 'c': 'र'}
{'origin': (159.6361083984375, 128.46484375), 'bbox': (159.6361083984375, 67.26271057128906, 159.6361083984375, 162.04791259765625), 'c': '+'}
{'origin': (159.6361083984375, 128.46484375), 'bbox': (159.6361083984375, 67.26271057128906, 186.4315643310547, 162.04791259765625), 'c': '4'}
{'origin': (186.43157958984375, 128.46484375), 'bbox': (186.43157958984375, 67.26271057128906, 200.19140625, 162.04791259765625), 'c': '-'}
{'origin': (200.19142150878906, 128.46484375), 'bbox': (200.19142150878906, 67.26271057128906, 226.98687744140625, 162.04791259765625), 'c': '4'}
{'origin': (226.98687744140625, 128.46484375), 'bbox': (226.98687744140625, 67.26271057128906, 240.7467041015625, 162.04791259765625), 'c': '\x07'}
{'origin': (240.74671936035156, 128.46484375), 'bbox': (240.74671936035156, 67.26271057128906, 260.9036560058594, 162.04791259765625), 'c': 'र'}
{'origin': (260.9036560058594, 128.46484375), 'bbox': (260.9036560058594, 67.26271057128906, 274.6634826660156, 162.04791259765625), 'c': '\x07'}
{'origin': (274.66351318359375, 128.46484375), 'bbox': (274.66351318359375, 67.26271057128906, 274.66351318359375, 162.04791259765625), 'c': '\x03'}
{'origin': (274.66351318359375, 128.46484375), 'bbox': (274.66351318359375, 67.26271057128906, 309.6665954589844, 162.04791259765625), 'c': '\x08'}
{'origin': (309.66656494140625, 128.46484375), 'bbox': (309.66656494140625, 67.26271057128906, 309.66656494140625, 162.04791259765625), 'c': '+'}
{'origin': (309.66656494140625, 128.46484375), 'bbox': (309.66656494140625, 67.26271057128906, 325.28662109375, 162.04791259765625), 'c': ' '}
{'origin': (325.28662109375, 128.46484375), 'bbox': (325.28662109375, 67.26271057128906, 339.04644775390625, 162.04791259765625), 'c': '\r'}
{'origin': (339.04644775390625, 128.46484375), 'bbox': (339.04644775390625, 67.26271057128906, 365.2384033203125, 162.04791259765625), 'c': '\x1e'}
{'origin': (365.2384338378906, 128.46484375), 'bbox': (365.2384338378906, 67.26271057128906, 394.4479064941406, 162.04791259765625), 'c': 'त'}
{'origin': (394.4479064941406, 128.46484375), 'bbox': (394.4479064941406, 67.26271057128906, 414.6048583984375, 162.04791259765625), 'c': 'र'}
{'origin': (414.6048583984375, 128.46484375), 'bbox': (414.6048583984375, 67.26271057128906, 451.65985107421875, 162.04791259765625), 'c': 'क'}
{'origin': (451.65985107421875, 128.46484375), 'bbox': (451.65985107421875, 67.26271057128906, 451.65985107421875, 162.04791259765625), 'c': '्'}
{'origin': (451.65985107421875, 128.46484375), 'bbox': (451.65985107421875, 67.26271057128906, 451.65985107421875, 162.04791259765625), 'c': 'ष'}
{'origin': (451.65985107421875, 128.46484375), 'bbox': (451.65985107421875, 67.26271057128906, 491.3702392578125, 162.04791259765625), 'c': 'ण'}

PyMuPDF version

1.24.0

Operating system

Windows

Python version

3.11

Cannot reproduce. Taking your code (properly indented), I am getting this:

import fitz

doc = fitz.open("test.pdf")
page = doc[0]
page.clean_contents()
blocks = page.get_text("rawdict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]
for b in blocks:
    for l in b["lines"]:
        for s in l["spans"]:
            if s["size"] > 32 and s["color"] == 2236191:
                for c in s["chars"]:
                    print(c)
{'origin': (61.44648361206055, 128.46484375), 'bbox': (61.44648361206055, 67.26271057128906, 81.60342407226562, 162.04791259765625), 'c': ''}
{'origin': (81.60343170166016, 128.46484375), 'bbox': (81.60343170166016, 67.26271057128906, 95.48396301269531, 162.04791259765625), 'c': ' '}
{'origin': (94.88040924072266, 128.46484375), 'bbox': (94.88040924072266, 67.26271057128906, 139.47915649414062, 162.04791259765625), 'c': ''}
{'origin': (139.47915649414062, 128.46484375), 'bbox': (139.47915649414062, 67.26271057128906, 159.6361083984375, 162.04791259765625), 'c': ''}
{'origin': (159.6361083984375, 128.46484375), 'bbox': (159.6361083984375, 67.26271057128906, 159.6361083984375, 162.04791259765625), 'c': ''}
{'origin': (159.6361083984375, 128.46484375), 'bbox': (159.6361083984375, 67.26271057128906, 186.4315643310547, 162.04791259765625), 'c': ''}
{'origin': (186.43157958984375, 128.46484375), 'bbox': (186.43157958984375, 67.26271057128906, 200.19140625, 162.04791259765625), 'c': ''}
{'origin': (200.19142150878906, 128.46484375), 'bbox': (200.19142150878906, 67.26271057128906, 226.98687744140625, 162.04791259765625), 'c': ''}
{'origin': (226.98687744140625, 128.46484375), 'bbox': (226.98687744140625, 67.26271057128906, 240.7467041015625, 162.04791259765625), 'c': ''}
{'origin': (240.74671936035156, 128.46484375), 'bbox': (240.74671936035156, 67.26271057128906, 260.9036560058594, 162.04791259765625), 'c': ''}
{'origin': (260.9036560058594, 128.46484375), 'bbox': (260.9036560058594, 67.26271057128906, 274.6634826660156, 162.04791259765625), 'c': ''}
{'origin': (274.66351318359375, 128.46484375), 'bbox': (274.66351318359375, 67.26271057128906, 274.66351318359375, 162.04791259765625), 'c': ''}
{'origin': (274.66351318359375, 128.46484375), 'bbox': (274.66351318359375, 67.26271057128906, 309.6665954589844, 162.04791259765625), 'c': ''}
{'origin': (309.66656494140625, 128.46484375), 'bbox': (309.66656494140625, 67.26271057128906, 309.66656494140625, 162.04791259765625), 'c': ''}
{'origin': (322.943603515625, 128.46484375), 'bbox': (322.943603515625, 67.26271057128906, 336.70343017578125, 162.04791259765625), 'c': ''}
{'origin': (336.70343017578125, 128.46484375), 'bbox': (336.70343017578125, 67.26271057128906, 362.8953857421875, 162.04791259765625), 'c': 'ि'}
{'origin': (362.8954162597656, 128.46484375), 'bbox': (362.8954162597656, 67.26271057128906, 392.1048889160156, 162.04791259765625), 'c': ''}
{'origin': (392.1048889160156, 128.46484375), 'bbox': (392.1048889160156, 67.26271057128906, 412.2618408203125, 162.04791259765625), 'c': ''}
{'origin': (412.2618408203125, 128.46484375), 'bbox': (412.2618408203125, 67.26271057128906, 449.31683349609375, 162.04791259765625), 'c': ''}
{'origin': (449.31683349609375, 128.46484375), 'bbox': (449.31683349609375, 67.26271057128906, 449.31683349609375, 162.04791259765625), 'c': ''}
{'origin': (449.31683349609375, 128.46484375), 'bbox': (449.31683349609375, 67.26271057128906, 449.31683349609375, 162.04791259765625), 'c': ''}
{'origin': (449.31683349609375, 128.46484375), 'bbox': (449.31683349609375, 67.26271057128906, 489.0272216796875, 162.04791259765625), 'c': ''}

HI I am using same code but my output is different below is my version details

Version of pymupdf :-PyMuPDF 1.23.21: Python bindings for the MuPDF 1.23.9 library (rebased implementation).\nPython 3.11 running on win32 (64-bit).\n'

I also try on Linux OS getting same result

So, in contrast to what you were stating in bug submission, you were not using the current version, but an earlier one!
The behavior using 1.24.0 is indeed different.

Thank you so much issue was resolved after update .