Leptonica 1.83.0 breaks tesseract, which in turn breaks pdfsandwich
swsch opened this issue · comments
Greetings.
After updating a Gentoo box to leptonica 1.83.0, pdfsandwich stopped working. Some experimenting let me pinpoint the problem with leptonica, as you can see in the bug report I filed as #891833 in gentoo's bugzilla.
In short: the same install of pdfsandwich and tesseract fails with leptonica 1.83.0 while it works with 1.82.0.
The relevant parts of pdfsandwich's verbose output:
# pdfsandwich -lang deu -gray -verbose -o 'test.pdf' 20230116_095121_3.pdf
pdfsandwich version 0.1.7
Version: ImageMagick 7.1.0-48 Q16 x86_64 20449 https://imagemagick.org/
Compiler: gcc (12.2)
unpaper 7.0.0
tesseract 5.3.0
leptonica-1.83.0
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libopenjp2 2.5.0
Found OpenMP 201511
Found libarchive 3.6.1 zlib/1.2.13 liblzma/5.2.9 bz2lib/1.0.8
Found libcurl/7.87.0 OpenSSL/1.1.1s zlib/1.2.13 libidn2/2.3.4 nghttp2/1.51.0
GPL Ghostscript 10.00.0 (2022-09-21)
pdfinfo version 23.01.0
pdfunite version 23.01.0
...
Input file: "20230116_095121_3.pdf"
Output file: "test.pdf"
Number of pages in inputfile: 1
More threads than pages. Using 1 threads instead.
Processing page 1.
identify -format "%w\n%h\n" "/tmp/pdfsandwich_tmp7852e4/pdfsandwich_inputfileb08248.pdf[0]"
convert -units PixelsPerInch -colorspace gray -depth 8 -background white -flatten -alpha Off -density 300x300 "/tmp/pdfsandwich_tmp7852e4/pdfsandwich_inputfileb08248.pdf[0]" /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm
Processing sheet #1: /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm -> /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm
[pgm_pipe @ 0x55b31216f9c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55b31216f9c0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55b31216f9c0] Encoder did not produce proper pts, making some up.
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif /tmp/pdfsandwich_tmp7852e4/pdfsandwich60ad08 -l deu pdf
Error in l_generateCIDataForPdf: cid not made from file
Error during processing.
ERROR: Command "OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif /tmp/pdfsandwich_tmp7852e4/pdfsandwich60ad08 -l deu pdf " failed.
Terminating pdfsandwich. All temporary files are kept.
After replace 1.83.0 with 1.82.0, the same file is handled as expected:
# pdfsandwich -lang deu -gray -verbose -o 'test.pdf' 20230116_095121_3.pdf
pdfsandwich version 0.1.7
...
tesseract 5.3.0
leptonica-1.82.0
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libopenjp2 2.5.0
...
Processing page 1.
identify -format "%w\n%h\n" "/tmp/pdfsandwich_tmp93c02c/pdfsandwich_inputfile09c03e.pdf[0]"
convert -units PixelsPerInch -colorspace gray -depth 8 -background white -flatten -alpha Off -density 300x300 "/tmp/pdfsandwich_tmp93c02c/pdfsandwich_inputfile09c03e.pdf[0]" /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm
Processing sheet #1: /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm -> /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm
[pgm_pipe @ 0x562b4dcf59c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x562b4dcf59c0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x562b4dcf59c0] Encoder did not produce proper pts, making some up.
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm /tmp/pdfsandwich_tmp93c02c/pdfsandwich038143.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp93c02c/pdfsandwich038143.tif /tmp/pdfsandwich_tmp93c02c/pdfsandwich91968e -l deu pdf
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmp93c02c/pdfsandwich42b301.pdf
OCR done. Writing "test.pdf"
mv "/tmp/pdfsandwich_tmp93c02c/pdfsandwich42b301.pdf" "test.pdf"
test.pdf generated.
Done.
I believe the problem is in pdfio2.c, lines 569-570.
if (!cid)
return ERROR_INT("cid not made from file", __func__, 1);
Please remove those two tlines and see if the test succeeds.
Removing these lines allows processing of similar files without error, so the patch should be good.
Many thanks for quick response.
Excellent. The fix is now in.
Will there be a point release including the patch? If not, I'll suggest adding the patch to the gentoo package, so that 1.83.0 will be working there, too.
It's a bit of work to make a patch release. I'll follow the advice of the tesseract maintainers, which is why I left this issue open for now.
Are you referring to a patch release 1.83.1? As the latest code is already prepared for 1.84.0, a patch release would need a branch 1.83 (I can add that if you want) and a list of patches which should be added.
Which commits after 1.83.0 should be included in the patch release, too? I'd suggest these commits:
Are there others?
That's a nice offer, Stefan.
I can also do it without a branch, modifying 1.84.0 --> 1.83.1 and including all existing commits.
Then wait a few days before changing 1.83.1 --> 1.84.0.
But on second thought, it might be easier for you. Those two commits are the only important ones.
See pull request #660 which adds the required changes for 1.83.1 to the new branch 1.83.
Much thanks, Stefan. Except for a patch on 1.81, this is the only patch that has been required for 5 years, since 1.75.
Closing this issue.