PDF tags after converting tags from PDF
Wowhere opened this issue · comments
PDF tags after converting tags from PDF
I'am trying to parse one archived Russian newspaper with following code
from os import listdir
from os.path import isfile, join
import sys
import pdftotext
from ftfy import fix_text
l = sys.argv[1]
onlyfiles = [f for f in listdir(l) if isfile(join(l, f))]
the_text = ""
for pdf_file in onlyfiles:
with open(l+pdf_file, "rb") as f:
pdf = pdftotext.PDF(f)
the_text += pdf_file+'\n'
#print(pdf_file)
for page in pdf:
the_text += str(page)
#print(page)
the_text += str(len(pdf))+'\n'
#print(len(pdf))
a = open('res.txt', 'w')
ttt = fix_text(the_text)
a.write(ttt)
a.close()
and after parsing get following.
How i can get rid of this strange text?
Can you provide a link to the PDF in question?
Sorry for delay. PDF is here https://drive.google.com/file/d/1zf9gJ18SKwEEJnTl-O-CDfdz09ZOcSq5/view?usp=sharing
The text that your red arrow is pointing at matches what is in the PDF. Probably the author of the PDF, instead of putting the correct text in the document, has put some text that only displays correctly with some custom font. We have no automatic way of reversing whatever stupid actions the author has taken here.
For comparison, here is how the text displays in google docs as well:
Юрий Лепский
ЧТО /4689:/; < =><>? @>A6, =B C=DB: =9E:> 9C F9- <6I9J =D CB?4B: =9E:> 9C =D/ =B ?>FB: LM>CMB<D:N O6A6- IBB.
How i can set custom font for pdftotext?
It sounds like you haven't made the effort to understand the issue.
There is no fix for us to do here. The creator of that PDF has made it so you can't automatically extract the correct text.