pdfcpu / pdfcpu

A PDF processor written in Go.

Home Page:http://pdfcpu.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

extract image: index out of range when decoding DCT encoded image

peltevis opened this issue · comments

  • Version: 0.7.0 release, I don't see the new commits addressing the issue
  • State your OS and OS version Mac OS 14.3
  • Using the pdfcpu api

The issue occurs with PDFs that appear to have specific characteristics in their image encoding, not tied to a specific PDF writer.

Description

I encountered a runtime panic due to an "index out of range" error in the renderDCTEncodedImage function when trying to extract images from certain PDF files using pdfcpu. The panic suggests an access attempt outside the bounds of an array.

After debugging the code for a bit I realized that the problem was that we were attempting to decode the image as if it had the YCbCr color space. In reality, the image had an Indexed(DeviceRGB) color space, but we were failing to recognize this as when casting the ColorSpace dict entry type we didn't typechecked for Array (the entry was an array as shown below). And thus just continued with the normal flow attempting to decode it as an YCbCr image.

Screenshot 2024-03-12 at 20 12 13

Proposed Solution

I've modified the renderDCTEncodedImage function based of what the library does on renderFlateEncodedImage to account for the ColorSpace entry to be an Array. I've included all other ColorSpace names but in reality I only need the IndexedCS for my use case.

func renderDCTEncodedImage(xRefTable *XRefTable, sd *StreamDict, thumb bool, resourceName string, objNr int) (io.Reader, string, error) {

	im, err := pdfImage(xRefTable, sd, thumb, objNr)
	if err != nil {
		return nil, "", err
	}

	o, err := xRefTable.DereferenceDictEntry(sd.Dict, "ColorSpace")
	if err != nil {
		return nil, "", err
	}

	switch cs := o.(type) {

	case Name:
		switch cs {

		case DeviceCMYKCS:
			return renderDeviceCMYKToPng(im, resourceName)

		// case DeviceRGBCS:
		// 	fmt.Println("DeviceRGBCS")
		// 	return renderDeviceRGBToPNG(im, resourceName)

		default:
			//fmt.Printf("renderDCTEncodedImage: objNr=%d, colorspace: %s\n", objNr, cs.String())
		}

	case Array:
		csn, _ := cs[0].(Name)

		switch csn {

		case CalRGBCS:
			return renderCalRGBToPNG(im, resourceName)

		case ICCBasedCS:
			return renderICCBased(xRefTable, im, resourceName, cs)

		case IndexedCS:
			return renderIndexed(xRefTable, im, resourceName, cs)

		default:
			log.Info.Printf("renderDCTEncodedImage: objNr=%d, unsupported array colorspace %s\n", objNr, csn)
		}
	}

	bb := bytes.NewReader(im.sd.Content)
	dec := gob.NewDecoder(bb)

	var img image.YCbCr
	if err := dec.Decode(&img); err != nil {
		return nil, "", err
	}

	var buf bytes.Buffer
	if err := png.Encode(&buf, &img); err != nil {
		return nil, "", err
	}

	return &buf, "png", nil
	//return &Image{&buf, 0, resourceName, im.thumb, "png"}, nil
}

I can't provide the PDF that causes the panic as it has some sensitive information.

Just realized I was using a really old pdfcpu version, thought it was 0.7.0. My bad!