Missing true character mappings if there are UTF-8 characters encoded

Question

Missing true character mappings if there are UTF-8 characters encoded

MaxBorn22 opened this issue 7 months ago · comments

pdf-testfile: Minimal set with UTF-8 characters encoded
ill01.pdf
ill00.pdf

What I expect: true character mappings if there are UTF-8 characters encoded, see at end for details.
See extracted text of ill01.pdf
See extracted text of ill00.pdf and search for terms that include 'ff' ot 'ft' or "n's"

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19045-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.4, crypt_provider=('cryptography', '41.0.7'), PIL=10.2.0

Code + PDF

from pypdf import PdfReader
def get_text(pdf_bytes: bytes) -> Text:
    with io.BytesIO(pdf_bytes) as pdf_stream:
        reader = PdfReader(pdf_stream)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
    return text

PDF file(s) that cause the issue. See top: ill01.pdf

Traceback

ill01.pdf
content of the pdf-file (seen at end):


/Encoding /Identity-H
/DescendantFonts [147 0 R]
/ToUnicode 148 0 R>>
endobj

What does the "CMap/encoding Identity-H" tell us?
the character codes (CIDs) are the same as the glyph indices (GIDs), so there's no need to remap them.

However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.

ill01.pdf

0000059775 00000 n 
0000000192 00000 n 
0000000392 00000 n 
0000000591 00000 n 
0000020809 00000 n 
0000006174 00000 n 
0000006481 00000 n 
0000006776 00000 n 
0000021054 00000 n 
0000012089 00000 n 
0000012374 00000 n 
0000012642 00000 n

pdf2json@3.0.5 [https://github.com/modesty/pdf2json]
-------------
json2pdf-log:
Warning: Output file will be replaced - ill01.json
Info: Transcoding File ill01.pdf to - ill01.json
Info: about to load PDF file ill01.pdf
Info: Load OK: ill01.pdf
Warning: Setting up fake worker.
Info: PDF loaded. pagesCount = 1
Info: start to parse page:1
Warning: TT: complementing a missing function tail
Info: Skipped: tiny fill: 0 x 0
Info: Success: Page 1
Info: complete parsing page:1
Info: PDF parsing completed.

Note that both viewers tested, Chromium or Edge, are able to map the UTF-8-characters as given,
pdf.js does not
pypdf does not

Stefan · Answer 1 · Fri Mar 01 2024 18:05:12 GMT+0800 (China Standard Time)

Your code will not work out of the box, as there a undefined names. Running

from pypdf import PdfReader


def get_text(path):
    reader = PdfReader(path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

print(get_text("ill01.pdf"))

yields

Home ownership 
traﬀicking 
leȅ  unchecked 
those traﬀicking deadly fentanyl,   
and Putin’s war profiteers

Running ill01.pdf through pdftotext (Poppler) instead, we get:

Home ownership
traﬀicking
le

unchecked

those traﬀicking deadly fentanyl,
and Putin’s war profiteers

Thus, it seems like this is a more general problem, not only affecting pypdf.

pubpub-zz · Answer 2 · Fri Mar 01 2024 18:13:16 GMT+0800 (China Standard Time)

I've tried to extract the text using Acrobat/Pdf.JS and Chrome and none shows the same results as pypdf. I agree with @stefan6419846 analysis

MaxBorn22 · Answer 3 · Sat Mar 02 2024 00:47:26 GMT+0800 (China Standard Time)

Thank You both,

seems that all these popular pdf-extractors are using some version of the incomplete pdfjs-dist-package.

get isIdentityCMap() {
    if (!(this.name === "Identity-H" || this.name === "Identity-V")) {
      return false;
    }

// A special case of CMap, where the _map array implicitly has a length of
// 65536 and each element is equal to its index.
class IdentityCMap extends CMap {
  constructor(vertical, n) {
    super();

    this.vertical = vertical;
    this.addCodespaceRange(n, 0, 0xffff);
  }

pdf.js/src/core
/cmap.js

pubpub-zz · Answer 4 · Sat Mar 02 2024 01:00:16 GMT+0800 (China Standard Time)

"Identity-H" / "Identity-V" is just an internal translation to get characters. You still need ToUnicode CMap.
I propose to close this issue

Stefan · Answer 5 · Sat Mar 02 2024 01:08:28 GMT+0800 (China Standard Time)

seems that all these popular pdf-extractors are using some version of the incomplete pdfjs-dist-package.

This is just wrong. pdf.js is the PDF renderer of Firefox. Poppler is a C++ library for working with PDF files and more or less the default backend for lots of free standalone PDF viewers. Chrome uses pdfium, another library.

You are always invited to clarify your issue or even propose a corresponding PR to fix possible issues within pypdf. For now, it seems like your PDF uses some rather strange edge case multiple "standard" tools are not able to handle correctly.

MaxBorn22 · Answer 6 · Sat Mar 02 2024 16:39:24 GMT+0800 (China Standard Time)

Stefan,
no strange things involved here: If you ever would have taken some view insight the meta data of the file provided: There you might have seen the producer.
There is a significant difference between pdf.js and [pdfjs-dist , pdf2son]. Only pdf.js does the correct handling of identity-maps.

Stefan · Answer 7 · Sat Mar 02 2024 17:02:05 GMT+0800 (China Standard Time)

no strange things involved here: If you ever would have taken some view insight the meta data of the file provided: There you might have seen the producer.

I have seen that the producer is marked as "Microsoft Print to PDF" already, but this does not change anything about my previous comments.

There is a significant difference between pdf.js and [pdfjs-dist , pdf2son]. Only pdf.js does the correct handling of identity-maps.

pdf.js and the other projects are in no way related to pypdf - except that they are PDF libraries as well.

I am going to close this issue for now as most pdf.js is out of scope here and I do not see any real progress in the discussions. If you have some more insights which actually are correct and not about pdf.js, feel free to drop a comment and we might decide to re-open this issue. Besides this, you are always invited to provide a corresponding fix to pypdf by creating an appropriate PR - this is what FOSS is about.