pdfcpu / pdfcpu

A PDF processor written in Go.

Home Page:http://pdfcpu.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Order of attachments

vsenko opened this issue · comments

It looks like v0.7.0 sorts the attachments alphanumerically by file name (ID), but as long as technically attached files are ordered in PDF, it confuses. For example in PDF text attached files could be referenced not only by name, but also by their order.

Code to reproduce:

package main

import (
	"bytes"
	"fmt"
	"strings"

	pdfcpuapi "github.com/pdfcpu/pdfcpu/pkg/api"
	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
	pdfcpucreate "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/create"
	pdfcpumodel "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/model"
	pdfcputypes "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/types"
)

func main() {
	ctx, err := pdfcpu.CreateContextWithXRefTable(pdfcpumodel.NewDefaultConfiguration(), pdfcputypes.PaperSize["A4"])
	if err != nil {
		panic(err)
	}

	template := `{"pages": { "1": { "content": { "text": [{ "value": "page 1", "anchor": "left", "font": { "name": "Helvetica", "size": 12 } }] } } } }`
	err = pdfcpucreate.FromJSON(ctx, strings.NewReader(template))
	if err != nil {
		panic(err)
	}

	err = ctx.AddAttachment(pdfcpumodel.Attachment{Reader: strings.NewReader("a"), ID: "a", Desc: "a"}, false)
	if err != nil {
		panic(err)
	}

	err = ctx.AddAttachment(pdfcpumodel.Attachment{Reader: strings.NewReader("1"), ID: "1", Desc: "1"}, false)
	if err != nil {
		panic(err)
	}

	err = ctx.AddAttachment(pdfcpumodel.Attachment{Reader: strings.NewReader("z"), ID: "z", Desc: "z"}, false)
	if err != nil {
		panic(err)
	}

	err = ctx.AddAttachment(pdfcpumodel.Attachment{Reader: strings.NewReader("d"), ID: "d", Desc: "d"}, false)
	if err != nil {
		panic(err)
	}

	var b bytes.Buffer
	err = pdfcpuapi.WriteContext(ctx, &b)
	if err != nil {
		panic(err)
	}

	attachements, err := pdfcpuapi.ExtractAttachmentsRaw(bytes.NewReader(b.Bytes()), "", nil, nil)
	if err != nil {
		panic(err)
	}

	for _, a := range attachements {
		fmt.Println(a.ID)
	}
}

Expected output:

a
1
z
d

Actual output:

1
a
d
z

That's because attachments are stored in a PDF EmbeddedFiles nametree.

As far as I understand, elements of EmbeddedFiles have an order. Thus it would be convenient if the attached files would be places there in the same order as they've been added.

Nametree keys are sorted in lexical order.
I am not inclined to hacking the order into the key.

After some research I now understand that Attachment.ID is not the file name, but the string that identifies the embedded file in EmbeddedFiles. What have been confusing me is that Attachment.FileName gets lost during attaching, here is the code to illustrate it:

package main

import (
	"bytes"
	"fmt"
	"os"
	"strings"

	pdfcpuapi "github.com/pdfcpu/pdfcpu/pkg/api"
	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
	pdfcpucreate "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/create"
	pdfcpumodel "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/model"
	pdfcputypes "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/types"
)

func main() {
	ctx, err := pdfcpu.CreateContextWithXRefTable(pdfcpumodel.NewDefaultConfiguration(), pdfcputypes.PaperSize["A4"])
	if err != nil {
		panic(err)
	}

	template := `{"pages": { "1": { "content": { "text": [{ "value": "page 1", "anchor": "left", "font": { "name": "Helvetica", "size": 12 } }] } } } }`
	err = pdfcpucreate.FromJSON(ctx, strings.NewReader(template))
	if err != nil {
		panic(err)
	}

	attachments := []pdfcpumodel.Attachment{
		{Reader: strings.NewReader("afile"), ID: "id1", FileName: "a.txt", Desc: "a-decs"},
		{Reader: strings.NewReader("1file"), ID: "id2", FileName: "1.txt", Desc: "1-decs"},
		{Reader: strings.NewReader("zfile"), ID: "id3", FileName: "z.txt", Desc: "z-decs"},
		{Reader: strings.NewReader("dfile"), ID: "id4", FileName: "d.txt", Desc: "d-decs"},
	}

	for _, a := range attachments {
		err = ctx.AddAttachment(a, false)
		if err != nil {
			panic(err)
		}
	}

	var b bytes.Buffer
	err = pdfcpuapi.WriteContext(ctx, &b)
	if err != nil {
		panic(err)
	}

	attachements, err := pdfcpuapi.ExtractAttachmentsRaw(bytes.NewReader(b.Bytes()), "", nil, nil)
	if err != nil {
		panic(err)
	}

	for _, a := range attachements {
		fmt.Println(a.ID, a.FileName, a.Desc)
	}
}

Expected output is:

id1 a.txt a-decs
id2 1.txt 1-decs
id3 z.txt z-decs
id4 d.txt d-decs

But the actual one is:

id1 id1 a-decs
id2 id2 1-decs
id3 id3 z-decs
id4 id4 d-decs

And actually if you analyze the constructed PDF, /F and /UF contain id1, not the file name. It happens here: https://github.com/pdfcpu/pdfcpu/blob/master/pkg/pdfcpu/model/attach.go#L122

return xRefTable.NewFileSpecDict(a.ID, a.ID, a.Desc, *sd)

a.ID gets passed as /F and /UF.

So this issue is actually not about the order of attachments, but about losing attachments files names.

Thanks I'll take a look.