bitcoin / bitcoin

Bitcoin Core integration/staging tree

Home Page:https://bitcoincore.org/en/download

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CI failure: `ThreadSanitizer: data race /usr/lib/llvm-12/bin/../include/c++/v1/ios:523:12 in std::__1::ios_base::width() const`

ryanofsky opened this issue · comments

I'm seeing this race failure in two unrelated PRs during the validation_chainstate_tests/chainstate_update_tip case:

https://cirrus-ci.com/task/5125235994263552?logs=ci#L4140 from #22702
https://cirrus-ci.com/task/5269464468946944?logs=ci#L4142 from #21206

test/validation_chainstate_tests.cpp(83): Entering test case "chainstate_update_tip"

...

[Makefile:18480: test/validation_chainstate_tests.cpp.test] Error 1

...

WARNING: ThreadSanitizer: data race (pid=23076)
Read of size 8 at 0x7f23d669c6d0 by main thread:
#0 std::__1::ios_base::width() const /usr/lib/llvm-12/bin/../include/c++/v1/ios:523:12 (test_bitcoin+0x17826c)

...

  Location is global 'std::__1::cout' of size 160 at 0x7f23d669c6b0 (libc++.so.1+0x0000000cd6d0)

...

SUMMARY: ThreadSanitizer: data race /usr/lib/llvm-12/bin/../include/c++/v1/ios:523:12 in std::__1::ios_base::width() const

I'm not sure if this is a real bug or spurious, but I did work around a similar problem previously (in #15719) by adding a suppression:

# Disable spurious error about std::cout.
# data race /usr/lib/llvm-10/bin/../include/c++/v1/ios:522:12 in std::__1::ios_base::width() const
# Location is global 'std::__1::cout' of size 160 at 0x7f492785e270 (libc++.so.1+0x0000000c0290)
# https://travis-ci.org/github/bitcoin/bitcoin/jobs/695004018
#
# Uses of std::cout are guaranteed thread safe by the c++ standard
# https://stackoverflow.com/questions/50322790/thread-safety-and-piping-to-streams
race:std::__1::ios_base::width

Also the same bug was observed on the master branch recently:

Indeed, this is not an issue with #15719, but with master

Indeed, this is not an issue with #15719, but with master

Suggested suppression in #23370.

Looking at https://cirrus-ci.com/github/bitcoin/bitcoin/master, I guess this problem does not happen reliably on master, but is nondeterministic? IIRC, when I was seeing similar TSAN errors months ago in #15719, the problem seemed to be deterministic and would happen every time.

The chainstate_update_tip was added a month ago in 673a5bd from #21526 by @jamesob, so maybe James could look at stack traces and clarify whether it is expected that there might be multiple threads writing to cout at same time.

I think the TSAN suppression seems safe, but am not completely confident about it. And even if it is thread safe it still means multiple threads could be writing to stdout at the same time and producing garbled output.

Also, I'm not sure why this bug would start happening now if #21526 was merged a month ago, so wondering if something else might have changed recently? Or if this happens infrequently enough to not be noticed? Or if this was noticed but was just ignored a little while?

Would be nice if there was a way to easily grep cirrus logs and find all the instances of this.

As mentioned std::cout is supposed to be threadsafe: https://eel.is/c++draft/iostream.objects#overview-7

The problem is, it seems that io manipulators are not threadsafe: the stream members are not atomic, and there are no locks: https://github.com/llvm/llvm-project/blob/release/12.x/libcxx/include/ios#L372

But many methods call ios.width(0) so that std::setw() is only valid for one output: https://en.cppreference.com/w/cpp/io/ios_base/width

This is done here: https://github.com/llvm/llvm-project/blob/main/libcxx/include/locale#L1400 which is called whenever e.g. operator<<(char const*) is called.

So even when the standard says that concurrent access to cout isn't supposed to have any data races, this is one right there. Not sure what to do about that.

Interesting that no one else ran into this before us. Maybe a bug should be reported, since I couldn't find any https://bugs.llvm.org/buglist.cgi?quicksearch=race%20ios_base&list_id=226152

I guess #23370 resolved CI failures, but from Martin's comment #23366 (comment), it seems like the suppression isn't ideal because there is a real data race. I think good followups would be to:

  • Improve comment added in #23370, so instead of being a blind link, it has some explanation and at least mentions std::cout. A good template could be comment from 81be7ff5 with updated information from #23366 (comment)
  • Add a new lock around uses of std::cout to remove the TSAN suppression. Or replace cout calls triggering this with calls that go through the logging framework to remove the suppression. Doing either of these would also prevent interleaved output from multiple threads in addition to avoiding need for the suppression.
  • Maybe check upstream and see if library implementation is violating the standard. From Martin's comment, it does seem like there is a bug in either the library or the standard, though it's not clear what a good fix would be.

Could maybe mark this issue up for grabs if this makes sense and is of interest.

Or replace cout calls triggering this with calls that go through the logging framework to remove the suppression.

If the boost test logging has internal locking, that seems a fine alternative. Are you referring to BOOST_TEST_MESSAGE https://www.boost.org/doc/libs/1_77_0/libs/test/doc/html/boost_test/utf_reference/testout_reference/test_output_macro_message.html ?

Or replace cout calls triggering this with calls that go through the logging framework to remove the suppression.

If the boost test logging has internal locking, that seems a fine alternative. Are you referring to BOOST_TEST_MESSAGE https://www.boost.org/doc/libs/1_77_0/libs/test/doc/html/boost_test/utf_reference/testout_reference/test_output_macro_message.html ?

I was being vague because I didn't look into where these cout writes were coming from. But from the cirrus TSAN output above, one cout call is coming G_TEST_LOG_FUN and the other one is coming from a boost method boost::unit_test::unit_test_log_t::operator<<(boost::unit_test::lazy_ostream const&). We also do have a few scattered BOOST_TEST_MESSAGE calls which would be a third source of cout writes.

To be able to remove the suppression, all these these cout writes would need to be protect with a lock. We can lock G_TEST_LOG_FUN ourselves since it is our own code. It looks like we could synchronize the boost cout output by redirecting it with boost::unit_test::unit_test_log.set_stream( std::ostream& ); https://www.boost.org/doc/libs/1_77_0/libs/test/doc/html/boost_test/test_output/logging_api/log_ct_output_stream_redirection.html with a custom std::ostream subclass that locks a mutex and forwards to cout.

I'm not sure if this would be worth it, but it seems something like this would be needed to remove the suppression.

Someone (not me) created a bug for exactly this issue, with a nice short reproducer: https://bugs.llvm.org/show_bug.cgi?id=52509

In C++20 there will be std::osyncstream: https://en.cppreference.com/w/cpp/io/basic_osyncstream So I think it would be nice to have something with the same API that's eventually replaced with std::osyncstream once we switch to C++20

bugs.llvm was deleted. The bug is now here: llvm/llvm-project#51851

Maybe we can set https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio as a work-around to get thread-safety?

Edit: nvm, this should already be set by default?

To be able to remove the suppression, all these these cout writes would need to be protect with a lock. We can lock G_TEST_LOG_FUN ourselves since it is our own code. It looks like we could synchronize the boost cout output by redirecting it with boost::unit_test::unit_test_log.set_stream( std::ostream& ); https://www.boost.org/doc/libs/1_77_0/libs/test/doc/html/boost_test/test_output/logging_api/log_ct_output_stream_redirection.html with a custom std::ostream subclass that locks a mutex and forwards to cout.

I'm not sure if this would be worth it, but it seems something like this would be needed to remove the suppression.

In C++20 there will be std::osyncstream: https://en.cppreference.com/w/cpp/io/basic_osyncstream So I think it would be nice to have something with the same API that's eventually replaced with std::osyncstream once we switch to C++20

I guess std::osyncstream does not work out of the box with boost::unit_test::unit_test_log.set_stream, because boost would internally need to call emit?