podofo / podofo

A C++17 PDF manipulation library

Home Page:https://podofo.github.io/podofo/documentation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Simply reading and saving without any opeartions on the attached PDF file causes data loss

kovidgoyal opened this issue · comments

  • I'm submitting a:

    • Vulnerability report -> use the specific form instead!
    • [x ] Bug report
    • Feature request/suggestion
  • [Bug report] What is the current behavior?
    Reading the PDF file into a PDFMemdocument and then writing it out to a new PDF file reduces the size from ~4MB to 40KB and the images are lost.

  • [Bug report] What is the expected behavior?

There is no significant file size reduction and the visual appearance of the exported PDF file is identical to the original.

  • [Bug report] Please provide the steps to reproduce and if possible a minimal reproduction code of the problem
    Call PdfMemeDocument::Load() then call PdfMemDocument::Save()

Note that loading the document reports a warning:
PoDoFoWARNING: Found object with reference 0 0 R different than reported 46 0 R in XRef sections

This probably means the PDF file is corrupted, but it is rendered correctly by every PDF program I tried. PoDoFo should ideally handle the corruption or throw an exception when trying to write instead of silently discarding the data.

I agree, something very wrong is happening here. I'm investigating.

The warning was spurious. I fixed it but it still doesn't fix the main issue.

Thanks :)

Decryption has been been seriously broken for long time in case of big enough chunk of data (>4096 bytes) and compiler not zero initialing stack variables (which is definitely allowed to) . I put the case on unit testing. Yesterday I rushed a 0.10.2 release before noticing this issue. I consider it critical enough that a 0.10.3 is needed soon (not all bugfixing goes to 0.10.x). Let's wait until end of week and if nothing else happens I will release 0.10.3. In the mean time, if you can give some testing I would be glad.

This is the branch where 0.10.3 will be tagged.

Here is another issue, basically reading the PDF, updating the info dict and the XMP metadata and saving the pdf is taking extremely long:
https://bugs.launchpad.net/calibre/+bug/2041745

Probably some quadratic algorithm somewhere. This is with 0.10.1 but I dont see anything in the changelog since then that could be relevant.

Please open another ticket. I briefly tested and it looks like the main culprit for the slow down is the garbage collection, which can be disabled with PdfSaveOptions::NoCollectGarbage. In general it look like the document has some inefficiencies, but I can't say for sure.