tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

Home Page:https://tesseract-ocr.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unicharset_extractor segfault

bertsky opened this issue · comments

Current Behavior

After I built from the current main with debug symbols (configure --disable-openmp --enable-debug --disable-shared CXXFLAGS="-g -O0 -fsanitize=address,undefined -fstack-protector-strong -ftrapv"), trying to use tesstrain immediately segfaults on the unicharset_extractor step (all-gt is 313k, norm_mode=2, nothing unusual):

    #0 0x7faa6352b17e in std::filesystem::__cxx11::path::compare(std::filesystem::__cxx11::path const&) const (/lib/x86_64-linux-gnu/libstdc++.so.6+0x19017e)
    #1 0x562c491ddc50 in std::filesystem::__cxx11::operator==(std::filesystem::__cxx11::path const&, std::filesystem::__cxx11::path const&) (/data/ocr-d/ocrd_all/venv38/bin/unicharset_extractor+0x2556c50)
    #2 0x562c491dc60d in Main /data/ocr-d/ocrd_all/tesseract/src/training/unicharset_extractor.cpp:74
    #3 0x562c491dd09d in main /data/ocr-d/ocrd_all/tesseract/src/training/unicharset_extractor.cpp:120
    #4 0x7faa625df6c9  (/lib/x86_64-linux-gnu/libc.so.6+0x276c9)
    #5 0x7faa625df784 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x27784)
    #6 0x562c491db8c0 in _start (/data/ocr-d/ocrd_all/venv38/bin/unicharset_extractor+0x25548c0)

I compiled with g++ 8.3.0.

Judging by the stack trace, there is some non-interopability with the C++ path library here...

Expected Behavior

The unicharset_extractor to exit normally, producing output.

Suggested Fix

No response

tesseract -v

tesseract 5.3.4
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX512BW
 Found AVX512F
 Found AVX512VNNI
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 liblz4/1.8.3 libzstd/1.3.8
 Found libcurl/7.64.0 NSS/3.42.1 zlib/1.2.11 libidn2/2.0.5 libpsl/0.20.2 (+libidn2/2.0.5) libssh2/1.11.0 nghttp2/1.59.0 librtmp/2.3

Operating System

Debian 11 Bullseye

Other Operating System

No response

uname -a

GNU/Linux x86_64

Compiler

g++ 8.3.0

CPU

Intel Xeon Gold

Virtualization / Containers

VMWare

Other Information

No response

Does the crash depend on the input data? I tried a build with g++ 12.2.0 and your configuration. Running unicharset_extractor --output_unicharset /tmp/unicode --norm_mode 2 all-gt works fine, no crash occurred.

Maybe older versions of g++ are simply broken with regard to sanitizers?

No – I just tried with all my previous trainings (without touching the existing files), it happens everywhere.

I also rebuilt completely – to no avail.

Will try with clang, and with an older version of Tesseract...

According to the libstdc++ manual, for GCC
8.x -lstdc++fs is needed if <filesystem> is included in the code.

Wow, same happens with clang++ 11.0.1-2~deb10u1!

-lstdc++fs is needed

that's what I initially suspected here

With clang use -stdlib=libc++. Otherwise it will use libstdc++.

For gcc and configure I think you need to add:

LIBS="$(LIBS) -lstdc++fs"

Not sure if that is all you need to add.

-lstdc++fs is needed

that's what I initially suspected here

Wow, I did manage to fix this by recompiling with -lstdc++fs!!

For gcc and configure I think you need to add:

LIBS="$(LIBS) -lstdc++fs"

I used LDFLAGS=-lstdc++fs make ... (because I noticed that this variable is not used otherwise). Not sure where to put it correctly.

But mind that clang++ behaves just the same.

Wow, same happens with clang++ 11.0.1-2~deb10u1!
But mind that clang++ behaves just the same.

Wait, looking more closely, it seems that passing CXX=clang++-11 (as instructed by the git compilation guide) was not enough to effect that compiler be used. Will investigate...

Did you installed clang and libc++?

Also read my comment about clang: #4200 (comment)

Did you installed clang and libc++?

Also read my comment about clang: #4200 (comment)

I used the recipe from tessdoc, i.e. passing CXX=clang++ to configure, which has autoconf check whether it supports -std=c++17 or -std=c++20 – the latter being true, that's what gets passed to each compiler/linker call.

Anyway, so just to confirm: I just recompiled with clang++-11 again, and the segfault is back.

I then applied the same trick (make unicharset_extractor LDFLAGS=-lstdc++fs -W src/training/unicharset_extractor.cpp to force recompilation) and voila the segfault is gone.

To sum up: both clang++ and g++ do need -lstdc++fs.

Not sure where to put it correctly.

So basically we could either put it in the respective targets' rules in Makefile.am, like for example checkmk does, or in LDFLAGS in configure.ac, like Libreoffice online does or Mega does.

But since unicharset_extractor seems to be the only place in the whole source where std::filesystem::path is used, mabe we should only declare that locally in Makefile.am's unicharset_extractor_LDADD, right?

On the other hand, I wonder why that library gets used at all. Wouldn't tesseract::ReadFile be the best fit actually (and so no extra library would be needed)?

Ha! That's indeed what it used to be...

@zdenop since this originates with you, could you please comment?

Your system has libstdc++ 8. This version does not have the filesystem header in it. This header is included in another external library. This is why you need the additional linker flag.

On Linux, Clang by default will use GNU libstdc++. In your case, it will use libstdc++ 8 which does not have the filesystem header...

In both cases, the source of the problem is not the compiler but the standard C++(17) library, which is not part of the compiler.

As I mentioned, there is a way to tell clang to use LLVM's libc++ instead of GNU's libstdc++.

As I mentioned, there is a way to tell clang to use llvm's libc++ instead of gnu libstdc++.

ok, but how does that work?

configure --disable-openmp --disable-shared CXX=clang++-11 LDFLAGS="-stdlib=libc++" CXXFLAGS="-g -O2 -fPIC"
checking whether the C++ compiler works... no
configure: error: C++ compiler cannot create executables

also, are you saying I must switch to clang because my libstdc++ is too old?

if that's correct, shouldn't Tesseract build be a little more tolerant/versatile?

also, are you saying I must switch to clang because my libstdc++ is too old?

I'm just saying: Don't blame clang for an issue in old libstdc++ which does not have the filesystem header in it.

Even if we fix the code / the documentation now, we can't keep support for old compilers and stdlibs versions forever. We want to move forwards. Fpr comparison, latest PyTorch requires GCC 9 pr a newer version. It seems TF also need GCC 9+.

BTW, PyTorch is planning to drop support for pre-m1 Macs (Intel Macs) soon, and Microsoft will drop support for its own MSVC 2019 in the next version of Onnx-Runtime.

Moving forwards...

I'm just saying: Don't blame clang for an issue in old libstdc++ which does not have the filesystem header in it.

Understood.

Even if we fix the code / the documentation now, we can't keep support for old compilers versions forever.

The problem was just introduced recently. It's the only place where that library gets used so far. Even if kept, it can be fixed easily, as I have shown. It's just a matter of 1-2 lines of configuration.

Also, Tesseract is not Pytorch, or ONNX – it's neither an ML framework nor state of the art. In fact, it's a legacy codebase, but very stable and reusable. That's why it still gets used and integrated so much. So there's a special responsibility not to arbitrarily break things here.

@bertsky : As far as remember std::filesystem::path was most easier multiplatform way how to handle file extension, so I just use it. Decision to use c++17 instead of c++14 was done 5 years ago and it is required for 3 years. But in code we still try to avoid it. Why?

According GA non of supported major platform and compiler has problem with it.

@zdenop

As far as remember std::filesystem::path was most easier multiplatform way how to handle file extension, so I just use it.

Understood. So how about the proposed fixes (configure.ac or Makefile.am)?

Decision to use c++17 instead of c++14 was done 5 years ago and it is required for 3 years. But in code we still try to avoid it. Why?

I wouldn't know.

According GA non of supported major platform and compiler has problem with it.

None of the GHA workflows test this, hence they don't detect the segfault.

@bertsky, is there a good reason why a rather old compiler like g++ 8.3.0 with the related libstdc++ and sanitizer flags should be supported? If not, then this issue should be closed.

@stweil yes, there is a good reason – see above.

Also, the install instructions say "you need a C++17 compiler", not "you need a C++17 compiler, but not that one, and not that certain version of its libstdc++".

BTW, how do you know only g++ 8.3.0 and its libstd++ are affected?

In the past, we mentioned the compilers we support and their minimum versions. I think we should re-add this info.

@stweil yes, there is a good reason – see above.

I don't see a good reason for using g++ 8.3.0 with sanitizer flags in your comments above. Which supported Linux distribution uses that compiler version?

To sum up: both clang++ and g++ do need -lstdc++fs.

That was not needed in my test with g++ 12.2.0.

@stweil

I don't see a good reason for using g++ 8.3.0 with sanitizer flags in your comments above.

Perhaps you overlooked it.

  1. don't break things lightly (there is practically no cost for fixing this)
  2. you don't even know what other versions are affected, since unicharset_extractor is not covered by any tests

Sorry, "it works for me" is not an answer I can accept. Why do you support multiple platforms in the first place?

Robert, please answer my questions. Which supported Linux distribution uses a compiler version which does not work? Without that information reproduction of the segfault is unnecessarily difficult, and then I'd suggest to close this issue.

The Linux distros that have GCC 8.x as their default compiler:

  • Debian 10 'buster' (oldoldstable). Note that the Debian project does not support buster anymore. A third party organization provides extended security support for buster.
  • RHEL 8 (and its clones). GCC 13 is also available (gcc-toolset in AppStream).

Personally, I don't think we should care about GCC 8 anymore.