Parsing document with large ObjectStreams uses lots of memory
fancycode opened this issue · comments
Joachim Bauch commented
Tested with latest master
(3282d8a) and the test script below:
package main
import (
"log"
"os"
"runtime"
"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
"github.com/pdfcpu/pdfcpu/pkg/pdfcpu/model"
)
func main() {
log.SetFlags(log.Flags() | log.Lmicroseconds)
fp, err := os.Open("prezz_2016.pdf")
if err != nil {
log.Fatal(err)
}
defer fp.Close()
conf := model.NewDefaultConfiguration()
log.Printf("Parsing ...")
var start, end runtime.MemStats
runtime.ReadMemStats(&start)
pdf, err := pdfcpu.Read(fp, conf)
runtime.GC()
runtime.ReadMemStats(&end)
if err != nil {
log.Fatal(err)
}
log.Printf("Done, uses %d MiBytes heap memory, %d MiBytes system memory",
(end.HeapAlloc-start.HeapAlloc)/(1024*1024),
(end.HeapSys-start.HeapSys)/(1024*1024),
)
if err := pdf.EnsurePageCount(); err != nil {
log.Fatal(err)
}
log.Printf("Parsed %d pages", pdf.PageCount)
}
Example file: prezz_2016.pdf
Original source:
http://www.sistemapiemonte.it/eXoRisorse/dwd/servizi/OperePubbliche/prezzario/prezz_2016.pdf
Output on my machine (Ubuntu 20.04, Go 1.22.1):
$ time go run test.go
2024/03/20 14:14:37.969259 Parsing ...
2024/03/20 14:14:49.661571 Done, uses 4244 MiBytes heap memory, 6783 MiBytes system memory
2024/03/20 14:14:49.661610 Parsed 1133 pages
real 0m12,381s
user 0m19,722s
sys 0m2,471s