unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)

Home Page:https://unidoc.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] ExtractText triggers a panic: runtime error: index out of range [0] with length 0

becoded opened this issue · comments

Description

We are using ExtractText() and from time to time, we are getting an index out of range error.

Stacktrace:

panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0

goroutine 21 [running]:
testing.tRunner.func1.2({0x1009ba340, 0x140001f5d28})
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1389 +0x1c8
testing.tRunner.func1()
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1392 +0x384
panic({0x1009ba340, 0x140001f5d28})
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/runtime/panic.go:838 +0x204
github.com/unidoc/unipdf/v3/internal/textencoding.CMapEncoder.CharcodeToRune(...)
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/internal/textencoding/textencoding.go:552
github.com/unidoc/unipdf/v3/extractor.(*textObject).renderText(0x14000ab02c0, {0x14000759328, 0x1, 0x8})
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:762 +0xab0
github.com/unidoc/unipdf/v3/extractor.(*textObject).showTextAdjusted(0x14000ab02c0, 0x1400000fea8)
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:132 +0x178
github.com/unidoc/unipdf/v3/extractor.(*Extractor).extractPageText.func1(0x1400034fdd0, {{0x1009f2d78, 0x100f63dc8}, {0x1009f2e80, 0x14000084360}, {0x1009801a0, 0x140006021c8}, {0x10099ad00, 0x140001f5cf8}, {0x3ff0000000000000, ...}}, ...)
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:797 +0x2348
github.com/unidoc/unipdf/v3/contentstream.(*ContentStreamProcessor).Process(0x14000765aa0, 0x100f63dc8?)
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/contentstream/contentstream.go:314 +0xa94
github.com/unidoc/unipdf/v3/extractor.(*Extractor).extractPageText(0x14000136060, {0x14000644000, 0x9a44e}, 0x14000418060?, {0x3ff0000000000000, 0x0, 0x0, 0x0, 0x3ff0000000000000, 0x0, ...}, ...)
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:828 +0x754
github.com/unidoc/unipdf/v3/extractor.(*Extractor).ExtractPageText(0x14000136060)
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:243 +0x74
github.com/unidoc/unipdf/v3/extractor.(*Extractor).ExtractTextWithStats(0x14000214380?)
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:508 +0x20
github.com/unidoc/unipdf/v3/extractor.(*Extractor).ExtractText(...)
	/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:526

Currently, the obfuscated code of CMapEncoder.CharcodeToRune, looks like:

func (_agg CMapEncoder) CharcodeToRune(code CharCode) (rune, bool) {
	_egf, _ceg := _agg.charcodeToString(code)
	return ([]rune(_egf))[0], _ceg
}

The error happens because charcodeToString returns in some cases for these files an empty string. And []rune("") = nil

So a potential fix would be:

func (_agg CMapEncoder) CharcodeToRune(code CharCode) (rune, bool) {
	_egf, _ceg := _agg.charcodeToString(code)

	if _egf == "" {
		return MissingCodeRune, false
	}

	return ([]rune(_egf))[0], _ceg
}

Expected Behavior

No panics when extracting text

Actual Behavior

Triggers a panic: runtime error: index out of range [0] with length 0 in certain cases

Attachments

Sadly enough, I can't share a file due to GDPR reasons.

Hi @becoded,

Thank you for reporting this issue and the potential fix.
We released new version v3.35.0 https://github.com/unidoc/unipdf-src/releases/tag/v3.35.0