Missing true character mappings if there are UTF-8 characters encoded
MaxBorn22 opened this issue · comments
pdf-testfile: Minimal set with UTF-8 characters encoded
ill01.pdf
ill00.pdf
What I expect: true character mappings if there are UTF-8 characters encoded, see at end for details.
See extracted text of ill01.pdf
See extracted text of ill00.pdf and search for terms that include 'ff' ot 'ft' or "n's"
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Windows-10-10.0.19045-SP0
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.4, crypt_provider=('cryptography', '41.0.7'), PIL=10.2.0
Code + PDF
from pypdf import PdfReader
def get_text(pdf_bytes: bytes) -> Text:
with io.BytesIO(pdf_bytes) as pdf_stream:
reader = PdfReader(pdf_stream)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
PDF file(s) that cause the issue. See top: ill01.pdf
Traceback
ill01.pdf
content of the pdf-file (seen at end):
/Encoding /Identity-H
/DescendantFonts [147 0 R]
/ToUnicode 148 0 R>>
endobj
What does the "CMap/encoding Identity-H" tell us?
the character codes (CIDs) are the same as the glyph indices (GIDs), so there's no need to remap them.
However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.
0000059775 00000 n
0000000192 00000 n
0000000392 00000 n
0000000591 00000 n
0000020809 00000 n
0000006174 00000 n
0000006481 00000 n
0000006776 00000 n
0000021054 00000 n
0000012089 00000 n
0000012374 00000 n
0000012642 00000 n
pdf2json@3.0.5 [https://github.com/modesty/pdf2json]
-------------
json2pdf-log:
Warning: Output file will be replaced - ill01.json
Info: Transcoding File ill01.pdf to - ill01.json
Info: about to load PDF file ill01.pdf
Info: Load OK: ill01.pdf
Warning: Setting up fake worker.
Info: PDF loaded. pagesCount = 1
Info: start to parse page:1
Warning: TT: complementing a missing function tail
Info: Skipped: tiny fill: 0 x 0
Info: Success: Page 1
Info: complete parsing page:1
Info: PDF parsing completed.
Note that both viewers tested, Chromium or Edge, are able to map the UTF-8-characters as given,
pdf.js does not
pypdf does not
Your code will not work out of the box, as there a undefined names. Running
from pypdf import PdfReader
def get_text(path):
reader = PdfReader(path)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
print(get_text("ill01.pdf"))
yields
Home ownership
trafficking
leȅ unchecked
those trafficking deadly fentanyl,
and Putin’s war profiteers
Running ill01.pdf
through pdftotext
(Poppler) instead, we get:
Home ownership
trafficking
le
unchecked
those trafficking deadly fentanyl,
and Putin’s war profiteers
Thus, it seems like this is a more general problem, not only affecting pypdf.
I've tried to extract the text using Acrobat/Pdf.JS and Chrome and none shows the same results as pypdf. I agree with @stefan6419846 analysis
Thank You both,
seems that all these popular pdf-extractors are using some version of the incomplete pdfjs-dist-package.
get isIdentityCMap() {
if (!(this.name === "Identity-H" || this.name === "Identity-V")) {
return false;
}
// A special case of CMap, where the _map array implicitly has a length of
// 65536 and each element is equal to its index.
class IdentityCMap extends CMap {
constructor(vertical, n) {
super();
this.vertical = vertical;
this.addCodespaceRange(n, 0, 0xffff);
}
"Identity-H" / "Identity-V" is just an internal translation to get characters. You still need ToUnicode CMap.
I propose to close this issue
seems that all these popular pdf-extractors are using some version of the incomplete pdfjs-dist-package.
This is just wrong. pdf.js
is the PDF renderer of Firefox. Poppler is a C++ library for working with PDF files and more or less the default backend for lots of free standalone PDF viewers. Chrome uses pdfium, another library.
You are always invited to clarify your issue or even propose a corresponding PR to fix possible issues within pypdf. For now, it seems like your PDF uses some rather strange edge case multiple "standard" tools are not able to handle correctly.
Stefan,
no strange things involved here: If you ever would have taken some view insight the meta data of the file provided: There you might have seen the producer.
There is a significant difference between pdf.js and [pdfjs-dist , pdf2son]. Only pdf.js does the correct handling of identity-maps.
no strange things involved here: If you ever would have taken some view insight the meta data of the file provided: There you might have seen the producer.
I have seen that the producer is marked as "Microsoft Print to PDF" already, but this does not change anything about my previous comments.
There is a significant difference between pdf.js and [pdfjs-dist , pdf2son]. Only pdf.js does the correct handling of identity-maps.
pdf.js
and the other projects are in no way related to pypdf - except that they are PDF libraries as well.
I am going to close this issue for now as most pdf.js
is out of scope here and I do not see any real progress in the discussions. If you have some more insights which actually are correct and not about pdf.js
, feel free to drop a comment and we might decide to re-open this issue. Besides this, you are always invited to provide a corresponding fix to pypdf by creating an appropriate PR - this is what FOSS is about.