pdfcpu / pdfcpu

A PDF processor written in Go.

Home Page:http://pdfcpu.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsing document with large ObjectStreams uses lots of memory

fancycode opened this issue · comments

Tested with latest master (3282d8a) and the test script below:

package main

import (
	"log"
	"os"
	"runtime"

	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu/model"
)

func main() {
	log.SetFlags(log.Flags() | log.Lmicroseconds)
	fp, err := os.Open("prezz_2016.pdf")
	if err != nil {
		log.Fatal(err)
	}
	defer fp.Close()

	conf := model.NewDefaultConfiguration()
	log.Printf("Parsing ...")
	var start, end runtime.MemStats
	runtime.ReadMemStats(&start)
	pdf, err := pdfcpu.Read(fp, conf)
	runtime.GC()
	runtime.ReadMemStats(&end)
	if err != nil {
		log.Fatal(err)
	}
	log.Printf("Done, uses %d MiBytes heap memory, %d MiBytes system memory",
		(end.HeapAlloc-start.HeapAlloc)/(1024*1024),
		(end.HeapSys-start.HeapSys)/(1024*1024),
	)

	if err := pdf.EnsurePageCount(); err != nil {
		log.Fatal(err)
	}

	log.Printf("Parsed %d pages", pdf.PageCount)
}

Example file: prezz_2016.pdf

Original source:
http://www.sistemapiemonte.it/eXoRisorse/dwd/servizi/OperePubbliche/prezzario/prezz_2016.pdf

Output on my machine (Ubuntu 20.04, Go 1.22.1):

$ time go run test.go 
2024/03/20 14:14:37.969259 Parsing ...
2024/03/20 14:14:49.661571 Done, uses 4244 MiBytes heap memory, 6783 MiBytes system memory
2024/03/20 14:14:49.661610 Parsed 1133 pages

real	0m12,381s
user	0m19,722s
sys	0m2,471s