PDF tags after converting tags from PDF

Question

PDF tags after converting tags from PDF

Wowhere opened this issue a year ago · comments

I'am trying to parse one archived Russian newspaper with following code

from os import listdir
from os.path import isfile, join
import sys
import pdftotext
from ftfy import fix_text

l = sys.argv[1]
onlyfiles = [f for f in listdir(l) if isfile(join(l, f))]

the_text = ""
for pdf_file in onlyfiles:
    with open(l+pdf_file, "rb") as f:
        pdf = pdftotext.PDF(f)
        the_text += pdf_file+'\n'
        #print(pdf_file)
        for page in pdf:
            the_text += str(page)
            #print(page)
        the_text += str(len(pdf))+'\n'
        #print(len(pdf))

a = open('res.txt', 'w')
ttt = fix_text(the_text)
a.write(ttt)
a.close()

and after parsing get following.

In PDF relevant text is here

How i can get rid of this strange text?

Jason Alan Palmer · Answer 1 · Sat Aug 05 2023 03:48:10 GMT+0800 (China Standard Time)

Can you provide a link to the PDF in question?

Wowhere · Answer 2 · Thu Aug 10 2023 18:03:47 GMT+0800 (China Standard Time)

Sorry for delay. PDF is here https://drive.google.com/file/d/1zf9gJ18SKwEEJnTl-O-CDfdz09ZOcSq5/view?usp=sharing

Jason Alan Palmer · Answer 3 · Wed Aug 16 2023 05:30:42 GMT+0800 (China Standard Time)

The text that your red arrow is pointing at matches what is in the PDF. Probably the author of the PDF, instead of putting the correct text in the document, has put some text that only displays correctly with some custom font. We have no automatic way of reversing whatever stupid actions the author has taken here.

For comparison, here is how the text displays in google docs as well:

Юрий Лепский 
ЧТО /4689:/; < =><>?  @>A6, =B C=DB: =9E:> 9C F9- <6I9J =D CB?4B: =9E:> 9C =D/  =B ?>FB: LM>CMB<D:N O6A6- IBB.

Wowhere · Answer 4 · Wed Aug 16 2023 17:12:30 GMT+0800 (China Standard Time)

How i can set custom font for pdftotext?

Jason Alan Palmer · Answer 5 · Wed Aug 16 2023 22:40:30 GMT+0800 (China Standard Time)

It sounds like you haven't made the effort to understand the issue.

There is no fix for us to do here. The creator of that PDF has made it so you can't automatically extract the correct text.