pdfcpu / pdfcpu

A PDF processor written in Go.

Home Page:http://pdfcpu.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dereference error with a particular (corrupt?) PDF

cjpartridgeb opened this issue · comments

I've recently been running benchmarks on pdfcpu and other PDF tools to try and modernize some of our PDF processes and run across this bug on with pdfcpu (ghostscript and others seem to process the file fine):

dereferenceAndLoad: problem dereferencing object 380: pdfcpu: pdfFilterPipeline: expected decodeParms array corrupt

Here's the full output with -vv:

`
<<<
<X0, (380 0 R)>

READ: 2024/04/25 12:54:48 logStream: no ObjectStreamDict
READ: 2024/04/25 12:54:48 dereferenceObject: begin, dereferencing object 380
READ: 2024/04/25 12:54:48 in use object 380
READ: 2024/04/25 12:54:48 dereferenceAndLoad: dereferencing object 380
READ: 2024/04/25 12:54:48 ParseObject: begin, obj#380, offset:913452
READ: 2024/04/25 12:54:48 newPositionedReader: positioned to offset: 913452
READ: 2024/04/25 12:54:48 buffer: endInd=-1 streamInd=168
READ: 2024/04/25 12:54:48 object: big stream, we parse object until stream
READ: 2024/04/25 12:54:48 pdfFilterPipeline: begin
READ: 2024/04/25 12:54:48 dereferencedObject: dereferencing object 382
READ: 2024/04/25 12:54:48 ParseObject: begin, obj#382, offset:1236490
READ: 2024/04/25 12:54:48 newPositionedReader: positioned to offset: 1236490
READ: 2024/04/25 12:54:48 object: small obj w/o stream, parse until endobj
Fatal: pdfcpu: pdfFilterPipeline: expected decodeParms array corrupt
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.pdfFilterPipeline
`

  • Tested on latest commit as of yesterday, and also with the version 0.8.0 build that I see released a few hours ago
  • All of my testing is on Linux, 64bit, various distros
  • I can't provide the source PDF, due to confidentiality reasons

I have managed to download the source, and had a tinker with building a fix - which I've done by no longer throwing an error when it fails to parse this particular dictionaries contents. This then caused another error later down the pipeline - to which we implemented another fix, to again not throw an error when the dictionary was not available.

This seems to work fine, and the custom built binary now processes the document without error, output PDF appears correct.

I will shortly submit a PR with the changes I've made, but please note that I'm a Go newbie and not sure if my changes may have any other ramifications.

Please submit a testfile going along with your patch.
Thank you!