Text is not replaced in PDFs generated by Word

Question

Text is not replaced in PDFs generated by Word

vbocan opened this issue 2 years ago · comments

I have recreated the the existing example.pdf file with Word, i.e. I created a simple Word file with the text "Hello World!" then exported it as a PDF file. When trying to replace the text in this PDF (as per the provided example), nothing happens. It looks like the "Hello World!" text is not even there.

What would be a proper way to replace the text in a Word-generated PDF?

Ahmed Alaa · Answer 1 · Sun Jan 15 2023 13:46:51 GMT+0800 (China Standard Time)

I have recreated the the existing example.pdf file with Word, i.e. I created a simple Word file with the text "Hello World!" then exported it as a PDF file. When trying to replace the text in this PDF (as per the provided example), nothing happens. It looks like the "Hello World!" text is not even there.

This isn't a bug in lopdf or Microsoft Word, the reason the text isn't replaced is due to how Word creates the text in the pdf (which is the standard as far as I can tell). The pdf standard is very complicated especially when it comes to text rendering, Word will embed a font subset (only the parts of the font you are using) to save on space which then means that text is not created as a literal string (used in the example) but as a hex string for example:
literal string: "AB"
hex string: <4142>

but a pdf creator can map hex values to any glyph they see fit so <01> could also render as an A and the only way you can tell which a hex code refers to what glyph is by examining the CMap (if it is present). the CMap i basically maps every hex code in the font to it's unicode value, something like:
<01> <41>
CMap is important for copying text from a pdf to work properly, if you ever tried to copy text from a pdf but what you paste was different from what you copied, that's because the CMap was corrupt.

What would be a proper way to replace the text in a Word-generated PDF?

Unfortunately it's not a simple matter, you first need to collect the text object in the content stream then determine which font object it uses then find the CMap for that font object then parse the CMap to find what each hex refers to and then you will be able to read the text and replace the hex codes with what you want.

But that's not all, it is possible that the glyph you what use isn't in the embedded font. In that case you will need to embed a new font with the glyph you want so that you may use it.

This library's goal isn't to be a full pdf editor but to be a pdf parser that takes in a pdf file and gives you a rust struct to work with, all the editing you have to do your self, meaning you need to understand how a pdf works.