Simply reading and saving without any opeartions on the attached PDF file causes data loss

Question

Simply reading and saving without any opeartions on the attached PDF file causes data loss

kovidgoyal opened this issue 10 months ago · comments

I'm submitting a:
- Vulnerability report -> use the specific form instead!
- [x ] Bug report
- Feature request/suggestion
[Bug report] What is the current behavior?
Reading the PDF file into a PDFMemdocument and then writing it out to a new PDF file reduces the size from ~4MB to 40KB and the images are lost.
[Bug report] What is the expected behavior?

There is no significant file size reduction and the visual appearance of the exported PDF file is identical to the original.

[Bug report] Please provide the steps to reproduce and if possible a minimal reproduction code of the problem
Call PdfMemeDocument::Load() then call PdfMemDocument::Save()

Note that loading the document reports a warning:
PoDoFoWARNING: Found object with reference 0 0 R different than reported 46 0 R in XRef sections

This probably means the PDF file is corrupted, but it is rendered correctly by every PDF program I tried. PoDoFo should ideally handle the corruption or throw an exception when trying to write instead of silently discarding the data.

Please tell us about your environment:
- Version/git revision: [1.10.0]
- Operating System: [all ]
- Package manager used: [source ]
[Bug report/Feature request] Other information
The problem document:
https://bugs.launchpad.net/calibre/+bug/2035026/+attachment/5699703/+files/Sourcefile.pdf

Kovid Goyal commented 8 months ago

Thanks :)

Francesco Pretto · Answer 1 · Mon Oct 30 2023 06:54:41 GMT+0800 (China Standard Time)

I agree, something very wrong is happening here. I'm investigating.

Francesco Pretto · Answer 2 · Mon Oct 30 2023 16:39:53 GMT+0800 (China Standard Time)

The warning was spurious. I fixed it but it still doesn't fix the main issue.

Francesco Pretto · Answer 3 · Tue Oct 31 2023 00:58:34 GMT+0800 (China Standard Time)

Decryption has been been seriously broken for long time in case of big enough chunk of data (>4096 bytes) and compiler not zero initialing stack variables (which is definitely allowed to) . I put the case on unit testing. Yesterday I rushed a 0.10.2 release before noticing this issue. I consider it critical enough that a 0.10.3 is needed soon (not all bugfixing goes to 0.10.x). Let's wait until end of week and if nothing else happens I will release 0.10.3. In the mean time, if you can give some testing I would be glad.

Francesco Pretto · Answer 4 · Tue Oct 31 2023 00:59:51 GMT+0800 (China Standard Time)

This is the branch where 0.10.3 will be tagged.

Kovid Goyal · Answer 5 · Tue Oct 31 2023 19:07:38 GMT+0800 (China Standard Time)

Here is another issue, basically reading the PDF, updating the info dict and the XMP metadata and saving the pdf is taking extremely long:
https://bugs.launchpad.net/calibre/+bug/2041745

Probably some quadratic algorithm somewhere. This is with 0.10.1 but I dont see anything in the changelog since then that could be relevant.

Francesco Pretto · Answer 6 · Tue Oct 31 2023 21:38:56 GMT+0800 (China Standard Time)

Please open another ticket. I briefly tested and it looks like the main culprit for the slow down is the garbage collection, which can be disabled with PdfSaveOptions::NoCollectGarbage. In general it look like the document has some inefficiencies, but I can't say for sure.