DanBloomberg / leptonica

Leptonica is an open source library containing software that is broadly useful for image processing and image analysis applications. The official github repository for Leptonica is: danbloomberg/leptonica. See leptonica.org for more documentation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Leptonica 1.83.0 breaks tesseract, which in turn breaks pdfsandwich

swsch opened this issue · comments

Greetings.

After updating a Gentoo box to leptonica 1.83.0, pdfsandwich stopped working. Some experimenting let me pinpoint the problem with leptonica, as you can see in the bug report I filed as #891833 in gentoo's bugzilla.

In short: the same install of pdfsandwich and tesseract fails with leptonica 1.83.0 while it works with 1.82.0.

The relevant parts of pdfsandwich's verbose output:

# pdfsandwich -lang deu -gray -verbose -o 'test.pdf' 20230116_095121_3.pdf
pdfsandwich version 0.1.7
Version: ImageMagick 7.1.0-48 Q16 x86_64 20449 https://imagemagick.org/
Compiler: gcc (12.2)
unpaper 7.0.0
tesseract 5.3.0
 leptonica-1.83.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libopenjp2 2.5.0
 Found OpenMP 201511
 Found libarchive 3.6.1 zlib/1.2.13 liblzma/5.2.9 bz2lib/1.0.8
 Found libcurl/7.87.0 OpenSSL/1.1.1s zlib/1.2.13 libidn2/2.3.4 nghttp2/1.51.0
GPL Ghostscript 10.00.0 (2022-09-21)
pdfinfo version 23.01.0
pdfunite version 23.01.0
...
Input file: "20230116_095121_3.pdf"
Output file: "test.pdf"
Number of pages in inputfile: 1
More threads than pages. Using 1 threads instead.
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmp7852e4/pdfsandwich_inputfileb08248.pdf[0]"
convert -units PixelsPerInch  -colorspace gray -depth 8 -background white -flatten -alpha Off -density 300x300  "/tmp/pdfsandwich_tmp7852e4/pdfsandwich_inputfileb08248.pdf[0]" /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm
Processing sheet #1: /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm -> /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm
[pgm_pipe @ 0x55b31216f9c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55b31216f9c0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55b31216f9c0] Encoder did not produce proper pts, making some up.
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif /tmp/pdfsandwich_tmp7852e4/pdfsandwich60ad08  -l deu pdf

Error in l_generateCIDataForPdf: cid not made from file
Error during processing.
ERROR: Command "OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif /tmp/pdfsandwich_tmp7852e4/pdfsandwich60ad08  -l deu pdf " failed.
Terminating pdfsandwich. All temporary files are kept.

After replace 1.83.0 with 1.82.0, the same file is handled as expected:

# pdfsandwich -lang deu -gray -verbose -o 'test.pdf' 20230116_095121_3.pdf
pdfsandwich version 0.1.7
...
tesseract 5.3.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libopenjp2 2.5.0
...
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmp93c02c/pdfsandwich_inputfile09c03e.pdf[0]"
convert -units PixelsPerInch  -colorspace gray -depth 8 -background white -flatten -alpha Off -density 300x300  "/tmp/pdfsandwich_tmp93c02c/pdfsandwich_inputfile09c03e.pdf[0]" /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm
Processing sheet #1: /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm -> /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm
[pgm_pipe @ 0x562b4dcf59c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x562b4dcf59c0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x562b4dcf59c0] Encoder did not produce proper pts, making some up.
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm /tmp/pdfsandwich_tmp93c02c/pdfsandwich038143.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp93c02c/pdfsandwich038143.tif /tmp/pdfsandwich_tmp93c02c/pdfsandwich91968e  -l deu pdf
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmp93c02c/pdfsandwich42b301.pdf

OCR done. Writing "test.pdf"
mv "/tmp/pdfsandwich_tmp93c02c/pdfsandwich42b301.pdf" "test.pdf"

test.pdf generated.

Done.

I believe the problem is in pdfio2.c, lines 569-570.

        if (!cid)
            return ERROR_INT("cid not made from file", __func__, 1);

Please remove those two tlines and see if the test succeeds.

Removing these lines allows processing of similar files without error, so the patch should be good.

Many thanks for quick response.

Excellent. The fix is now in.

Will there be a point release including the patch? If not, I'll suggest adding the patch to the gentoo package, so that 1.83.0 will be working there, too.

@stweil

It's a bit of work to make a patch release. I'll follow the advice of the tesseract maintainers, which is why I left this issue open for now.

Are you referring to a patch release 1.83.1? As the latest code is already prepared for 1.84.0, a patch release would need a branch 1.83 (I can add that if you want) and a list of patches which should be added.

Which commits after 1.83.0 should be included in the patch release, too? I'd suggest these commits:

Are there others?

That's a nice offer, Stefan.

I can also do it without a branch, modifying 1.84.0 --> 1.83.1 and including all existing commits.
Then wait a few days before changing 1.83.1 --> 1.84.0.

But on second thought, it might be easier for you. Those two commits are the only important ones.

See pull request #660 which adds the required changes for 1.83.1 to the new branch 1.83.

Much thanks, Stefan. Except for a patch on 1.81, this is the only patch that has been required for 5 years, since 1.75.

Closing this issue.