Order of attachments

Question

Order of attachments

vsenko opened this issue 3 months ago · comments

It looks like v0.7.0 sorts the attachments alphanumerically by file name (ID), but as long as technically attached files are ordered in PDF, it confuses. For example in PDF text attached files could be referenced not only by name, but also by their order.

Code to reproduce:

package main

import (
	"bytes"
	"fmt"
	"strings"

	pdfcpuapi "github.com/pdfcpu/pdfcpu/pkg/api"
	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
	pdfcpucreate "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/create"
	pdfcpumodel "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/model"
	pdfcputypes "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/types"
)

func main() {
	ctx, err := pdfcpu.CreateContextWithXRefTable(pdfcpumodel.NewDefaultConfiguration(), pdfcputypes.PaperSize["A4"])
	if err != nil {
		panic(err)
	}

	template := `{"pages": { "1": { "content": { "text": [{ "value": "page 1", "anchor": "left", "font": { "name": "Helvetica", "size": 12 } }] } } } }`
	err = pdfcpucreate.FromJSON(ctx, strings.NewReader(template))
	if err != nil {
		panic(err)
	}

	err = ctx.AddAttachment(pdfcpumodel.Attachment{Reader: strings.NewReader("a"), ID: "a", Desc: "a"}, false)
	if err != nil {
		panic(err)
	}

	err = ctx.AddAttachment(pdfcpumodel.Attachment{Reader: strings.NewReader("1"), ID: "1", Desc: "1"}, false)
	if err != nil {
		panic(err)
	}

	err = ctx.AddAttachment(pdfcpumodel.Attachment{Reader: strings.NewReader("z"), ID: "z", Desc: "z"}, false)
	if err != nil {
		panic(err)
	}

	err = ctx.AddAttachment(pdfcpumodel.Attachment{Reader: strings.NewReader("d"), ID: "d", Desc: "d"}, false)
	if err != nil {
		panic(err)
	}

	var b bytes.Buffer
	err = pdfcpuapi.WriteContext(ctx, &b)
	if err != nil {
		panic(err)
	}

	attachements, err := pdfcpuapi.ExtractAttachmentsRaw(bytes.NewReader(b.Bytes()), "", nil, nil)
	if err != nil {
		panic(err)
	}

	for _, a := range attachements {
		fmt.Println(a.ID)
	}
}

Expected output:

a
1
z
d

Actual output:

1
a
d
z

Horst Rutter · Answer 1 · Wed Mar 06 2024 17:35:48 GMT+0800 (China Standard Time)

That's because attachments are stored in a PDF EmbeddedFiles nametree.

vsenko · Answer 2 · Wed Mar 06 2024 18:59:39 GMT+0800 (China Standard Time)

As far as I understand, elements of EmbeddedFiles have an order. Thus it would be convenient if the attached files would be places there in the same order as they've been added.

Horst Rutter · Answer 3 · Wed Mar 06 2024 19:09:48 GMT+0800 (China Standard Time)

Nametree keys are sorted in lexical order.
I am not inclined to hacking the order into the key.

vsenko · Answer 4 · Wed Mar 06 2024 22:04:48 GMT+0800 (China Standard Time)

After some research I now understand that Attachment.ID is not the file name, but the string that identifies the embedded file in EmbeddedFiles. What have been confusing me is that Attachment.FileName gets lost during attaching, here is the code to illustrate it:

package main

import (
	"bytes"
	"fmt"
	"os"
	"strings"

	pdfcpuapi "github.com/pdfcpu/pdfcpu/pkg/api"
	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
	pdfcpucreate "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/create"
	pdfcpumodel "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/model"
	pdfcputypes "github.com/pdfcpu/pdfcpu/pkg/pdfcpu/types"
)

func main() {
	ctx, err := pdfcpu.CreateContextWithXRefTable(pdfcpumodel.NewDefaultConfiguration(), pdfcputypes.PaperSize["A4"])
	if err != nil {
		panic(err)
	}

	template := `{"pages": { "1": { "content": { "text": [{ "value": "page 1", "anchor": "left", "font": { "name": "Helvetica", "size": 12 } }] } } } }`
	err = pdfcpucreate.FromJSON(ctx, strings.NewReader(template))
	if err != nil {
		panic(err)
	}

	attachments := []pdfcpumodel.Attachment{
		{Reader: strings.NewReader("afile"), ID: "id1", FileName: "a.txt", Desc: "a-decs"},
		{Reader: strings.NewReader("1file"), ID: "id2", FileName: "1.txt", Desc: "1-decs"},
		{Reader: strings.NewReader("zfile"), ID: "id3", FileName: "z.txt", Desc: "z-decs"},
		{Reader: strings.NewReader("dfile"), ID: "id4", FileName: "d.txt", Desc: "d-decs"},
	}

	for _, a := range attachments {
		err = ctx.AddAttachment(a, false)
		if err != nil {
			panic(err)
		}
	}

	var b bytes.Buffer
	err = pdfcpuapi.WriteContext(ctx, &b)
	if err != nil {
		panic(err)
	}

	attachements, err := pdfcpuapi.ExtractAttachmentsRaw(bytes.NewReader(b.Bytes()), "", nil, nil)
	if err != nil {
		panic(err)
	}

	for _, a := range attachements {
		fmt.Println(a.ID, a.FileName, a.Desc)
	}
}

Expected output is:

id1 a.txt a-decs
id2 1.txt 1-decs
id3 z.txt z-decs
id4 d.txt d-decs

But the actual one is:

id1 id1 a-decs
id2 id2 1-decs
id3 id3 z-decs
id4 id4 d-decs

And actually if you analyze the constructed PDF, /F and /UF contain id1, not the file name. It happens here: https://github.com/pdfcpu/pdfcpu/blob/master/pkg/pdfcpu/model/attach.go#L122

return xRefTable.NewFileSpecDict(a.ID, a.ID, a.Desc, *sd)

a.ID gets passed as /F and /UF.

So this issue is actually not about the order of attachments, but about losing attachments files names.

Horst Rutter · Answer 5 · Thu Mar 07 2024 23:23:21 GMT+0800 (China Standard Time)

Thanks I'll take a look.