[BUG] ExtractText triggers a panic: runtime error: index out of range [0] with length 0
becoded opened this issue · comments
Description
We are using ExtractText()
and from time to time, we are getting an index out of range error.
Stacktrace:
panic: runtime error: index out of range [0] with length 0 [recovered]
panic: runtime error: index out of range [0] with length 0
goroutine 21 [running]:
testing.tRunner.func1.2({0x1009ba340, 0x140001f5d28})
/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1389 +0x1c8
testing.tRunner.func1()
/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1392 +0x384
panic({0x1009ba340, 0x140001f5d28})
/opt/homebrew/Cellar/go/1.18.1/libexec/src/runtime/panic.go:838 +0x204
github.com/unidoc/unipdf/v3/internal/textencoding.CMapEncoder.CharcodeToRune(...)
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/internal/textencoding/textencoding.go:552
github.com/unidoc/unipdf/v3/extractor.(*textObject).renderText(0x14000ab02c0, {0x14000759328, 0x1, 0x8})
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:762 +0xab0
github.com/unidoc/unipdf/v3/extractor.(*textObject).showTextAdjusted(0x14000ab02c0, 0x1400000fea8)
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:132 +0x178
github.com/unidoc/unipdf/v3/extractor.(*Extractor).extractPageText.func1(0x1400034fdd0, {{0x1009f2d78, 0x100f63dc8}, {0x1009f2e80, 0x14000084360}, {0x1009801a0, 0x140006021c8}, {0x10099ad00, 0x140001f5cf8}, {0x3ff0000000000000, ...}}, ...)
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:797 +0x2348
github.com/unidoc/unipdf/v3/contentstream.(*ContentStreamProcessor).Process(0x14000765aa0, 0x100f63dc8?)
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/contentstream/contentstream.go:314 +0xa94
github.com/unidoc/unipdf/v3/extractor.(*Extractor).extractPageText(0x14000136060, {0x14000644000, 0x9a44e}, 0x14000418060?, {0x3ff0000000000000, 0x0, 0x0, 0x0, 0x3ff0000000000000, 0x0, ...}, ...)
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:828 +0x754
github.com/unidoc/unipdf/v3/extractor.(*Extractor).ExtractPageText(0x14000136060)
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:243 +0x74
github.com/unidoc/unipdf/v3/extractor.(*Extractor).ExtractTextWithStats(0x14000214380?)
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:508 +0x20
github.com/unidoc/unipdf/v3/extractor.(*Extractor).ExtractText(...)
/Projects/go/workspace/pkg/mod/github.com/unidoc/unipdf/v3@v3.34.0/extractor/extractor.go:526
Currently, the obfuscated code of CMapEncoder.CharcodeToRune
, looks like:
func (_agg CMapEncoder) CharcodeToRune(code CharCode) (rune, bool) {
_egf, _ceg := _agg.charcodeToString(code)
return ([]rune(_egf))[0], _ceg
}
The error happens because charcodeToString
returns in some cases for these files an empty string. And []rune("")
= nil
So a potential fix would be:
func (_agg CMapEncoder) CharcodeToRune(code CharCode) (rune, bool) {
_egf, _ceg := _agg.charcodeToString(code)
if _egf == "" {
return MissingCodeRune, false
}
return ([]rune(_egf))[0], _ceg
}
Expected Behavior
No panics when extracting text
Actual Behavior
Triggers a panic: runtime error: index out of range [0] with length 0 in certain cases
Attachments
Sadly enough, I can't share a file due to GDPR reasons.
Hi @becoded,
Thank you for reporting this issue and the potential fix.
We released new version v3.35.0 https://github.com/unidoc/unipdf-src/releases/tag/v3.35.0