hind dot not extracting when extracting text from pdf
aleem75321 opened this issue · comments
Description of the bug
When I extracted text from a PDF containing Marathi language, Hindi grammar dots did not appear on some texts.
If you look at the text output extract given below, you will find that there are some words tu hai par us and in place of binod there are some numerical numbers.
How to reproduce the bug
When I extracted text from a PDF containing Marathi language, Hindi grammar dots did not appear on some texts.
If you look at the text output extract given below, you will find that there are some words tu hai par us and in place of binod there are some numerical numbers.
test.pdf
Simple file attached
I have also attached some output and code
#Code synatx
doc=fitz.open("test_pages/02032024_MTM_MP_0002_1_COL_R1.pdf")
page=doc[0]
page.clean_contents()
black = fitz.pdfcolor["red"]
blocks=page.get_text("rawdict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
for b in blocks:
for l in b["lines"]:
for s in l["spans"]:
if s["size"]>32 and s['color']==2236191:
for c in s["chars"]:
print(c)
#output
{'origin': (28.978105545043945, 128.46484375), 'bbox': (28.978105545043945, 67.26271057128906, 61.44647979736328, 162.04791259765625), 'c': 'घ'}
{'origin': (61.44648361206055, 128.46484375), 'bbox': (61.44648361206055, 67.26271057128906, 81.60342407226562, 162.04791259765625), 'c': 'र'}
{'origin': (81.60343170166016, 128.46484375), 'bbox': (81.60343170166016, 67.26271057128906, 95.48396301269531, 162.04791259765625), 'c': ' '}
{'origin': (94.88040924072266, 128.46484375), 'bbox': (94.88040924072266, 67.26271057128906, 139.47915649414062, 162.04791259765625), 'c': 'ख'}
{'origin': (139.47915649414062, 128.46484375), 'bbox': (139.47915649414062, 67.26271057128906, 159.6361083984375, 162.04791259765625), 'c': 'र'}
{'origin': (159.6361083984375, 128.46484375), 'bbox': (159.6361083984375, 67.26271057128906, 159.6361083984375, 162.04791259765625), 'c': '+'}
{'origin': (159.6361083984375, 128.46484375), 'bbox': (159.6361083984375, 67.26271057128906, 186.4315643310547, 162.04791259765625), 'c': '4'}
{'origin': (186.43157958984375, 128.46484375), 'bbox': (186.43157958984375, 67.26271057128906, 200.19140625, 162.04791259765625), 'c': '-'}
{'origin': (200.19142150878906, 128.46484375), 'bbox': (200.19142150878906, 67.26271057128906, 226.98687744140625, 162.04791259765625), 'c': '4'}
{'origin': (226.98687744140625, 128.46484375), 'bbox': (226.98687744140625, 67.26271057128906, 240.7467041015625, 162.04791259765625), 'c': '\x07'}
{'origin': (240.74671936035156, 128.46484375), 'bbox': (240.74671936035156, 67.26271057128906, 260.9036560058594, 162.04791259765625), 'c': 'र'}
{'origin': (260.9036560058594, 128.46484375), 'bbox': (260.9036560058594, 67.26271057128906, 274.6634826660156, 162.04791259765625), 'c': '\x07'}
{'origin': (274.66351318359375, 128.46484375), 'bbox': (274.66351318359375, 67.26271057128906, 274.66351318359375, 162.04791259765625), 'c': '\x03'}
{'origin': (274.66351318359375, 128.46484375), 'bbox': (274.66351318359375, 67.26271057128906, 309.6665954589844, 162.04791259765625), 'c': '\x08'}
{'origin': (309.66656494140625, 128.46484375), 'bbox': (309.66656494140625, 67.26271057128906, 309.66656494140625, 162.04791259765625), 'c': '+'}
{'origin': (309.66656494140625, 128.46484375), 'bbox': (309.66656494140625, 67.26271057128906, 325.28662109375, 162.04791259765625), 'c': ' '}
{'origin': (325.28662109375, 128.46484375), 'bbox': (325.28662109375, 67.26271057128906, 339.04644775390625, 162.04791259765625), 'c': '\r'}
{'origin': (339.04644775390625, 128.46484375), 'bbox': (339.04644775390625, 67.26271057128906, 365.2384033203125, 162.04791259765625), 'c': '\x1e'}
{'origin': (365.2384338378906, 128.46484375), 'bbox': (365.2384338378906, 67.26271057128906, 394.4479064941406, 162.04791259765625), 'c': 'त'}
{'origin': (394.4479064941406, 128.46484375), 'bbox': (394.4479064941406, 67.26271057128906, 414.6048583984375, 162.04791259765625), 'c': 'र'}
{'origin': (414.6048583984375, 128.46484375), 'bbox': (414.6048583984375, 67.26271057128906, 451.65985107421875, 162.04791259765625), 'c': 'क'}
{'origin': (451.65985107421875, 128.46484375), 'bbox': (451.65985107421875, 67.26271057128906, 451.65985107421875, 162.04791259765625), 'c': '्'}
{'origin': (451.65985107421875, 128.46484375), 'bbox': (451.65985107421875, 67.26271057128906, 451.65985107421875, 162.04791259765625), 'c': 'ष'}
{'origin': (451.65985107421875, 128.46484375), 'bbox': (451.65985107421875, 67.26271057128906, 491.3702392578125, 162.04791259765625), 'c': 'ण'}
PyMuPDF version
1.24.0
Operating system
Windows
Python version
3.11
Cannot reproduce. Taking your code (properly indented), I am getting this:
import fitz
doc = fitz.open("test.pdf")
page = doc[0]
page.clean_contents()
blocks = page.get_text("rawdict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]
for b in blocks:
for l in b["lines"]:
for s in l["spans"]:
if s["size"] > 32 and s["color"] == 2236191:
for c in s["chars"]:
print(c)
{'origin': (61.44648361206055, 128.46484375), 'bbox': (61.44648361206055, 67.26271057128906, 81.60342407226562, 162.04791259765625), 'c': 'र'}
{'origin': (81.60343170166016, 128.46484375), 'bbox': (81.60343170166016, 67.26271057128906, 95.48396301269531, 162.04791259765625), 'c': ' '}
{'origin': (94.88040924072266, 128.46484375), 'bbox': (94.88040924072266, 67.26271057128906, 139.47915649414062, 162.04791259765625), 'c': 'ख'}
{'origin': (139.47915649414062, 128.46484375), 'bbox': (139.47915649414062, 67.26271057128906, 159.6361083984375, 162.04791259765625), 'c': 'र'}
{'origin': (159.6361083984375, 128.46484375), 'bbox': (159.6361083984375, 67.26271057128906, 159.6361083984375, 162.04791259765625), 'c': 'े'}
{'origin': (159.6361083984375, 128.46484375), 'bbox': (159.6361083984375, 67.26271057128906, 186.4315643310547, 162.04791259765625), 'c': 'द'}
{'origin': (186.43157958984375, 128.46484375), 'bbox': (186.43157958984375, 67.26271057128906, 200.19140625, 162.04791259765625), 'c': 'ी'}
{'origin': (200.19142150878906, 128.46484375), 'bbox': (200.19142150878906, 67.26271057128906, 226.98687744140625, 162.04791259765625), 'c': 'द'}
{'origin': (226.98687744140625, 128.46484375), 'bbox': (226.98687744140625, 67.26271057128906, 240.7467041015625, 162.04791259765625), 'c': 'ा'}
{'origin': (240.74671936035156, 128.46484375), 'bbox': (240.74671936035156, 67.26271057128906, 260.9036560058594, 162.04791259765625), 'c': 'र'}
{'origin': (260.9036560058594, 128.46484375), 'bbox': (260.9036560058594, 67.26271057128906, 274.6634826660156, 162.04791259765625), 'c': 'ा'}
{'origin': (274.66351318359375, 128.46484375), 'bbox': (274.66351318359375, 67.26271057128906, 274.66351318359375, 162.04791259765625), 'c': 'ं'}
{'origin': (274.66351318359375, 128.46484375), 'bbox': (274.66351318359375, 67.26271057128906, 309.6665954589844, 162.04791259765625), 'c': 'च'}
{'origin': (309.66656494140625, 128.46484375), 'bbox': (309.66656494140625, 67.26271057128906, 309.66656494140625, 162.04791259765625), 'c': 'े'}
{'origin': (322.943603515625, 128.46484375), 'bbox': (322.943603515625, 67.26271057128906, 336.70343017578125, 162.04791259765625), 'c': 'ह'}
{'origin': (336.70343017578125, 128.46484375), 'bbox': (336.70343017578125, 67.26271057128906, 362.8953857421875, 162.04791259765625), 'c': 'ि'}
{'origin': (362.8954162597656, 128.46484375), 'bbox': (362.8954162597656, 67.26271057128906, 392.1048889160156, 162.04791259765625), 'c': 'त'}
{'origin': (392.1048889160156, 128.46484375), 'bbox': (392.1048889160156, 67.26271057128906, 412.2618408203125, 162.04791259765625), 'c': 'र'}
{'origin': (412.2618408203125, 128.46484375), 'bbox': (412.2618408203125, 67.26271057128906, 449.31683349609375, 162.04791259765625), 'c': 'क'}
{'origin': (449.31683349609375, 128.46484375), 'bbox': (449.31683349609375, 67.26271057128906, 449.31683349609375, 162.04791259765625), 'c': '्'}
{'origin': (449.31683349609375, 128.46484375), 'bbox': (449.31683349609375, 67.26271057128906, 449.31683349609375, 162.04791259765625), 'c': 'ष'}
{'origin': (449.31683349609375, 128.46484375), 'bbox': (449.31683349609375, 67.26271057128906, 489.0272216796875, 162.04791259765625), 'c': 'ण'}
HI I am using same code but my output is different below is my version details
Version of pymupdf :-PyMuPDF 1.23.21: Python bindings for the MuPDF 1.23.9 library (rebased implementation).\nPython 3.11 running on win32 (64-bit).\n'
I also try on Linux OS getting same result
So, in contrast to what you were stating in bug submission, you were not using the current version, but an earlier one!
The behavior using 1.24.0 is indeed different.
Thank you so much issue was resolved after update .