openssl / openssl

TLS/SSL and crypto library

Home Page:https://www.openssl.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tsan data race between sa_doall and ossl_sa_set

rschu1ze opened this issue · comments

We (ClickHouse, an open-source analytical database) recently migrated from boringssl to OpenSSL 3.2 (ClickHouse/ClickHouse#59870).

Many of our tests are executed with *sanitizer instrumentation (thread, memory, address). One test checks the MySQL connector of ClickHouse and it fails with a data race detected by thread sanitizer in OpenSSL.

Here is the downstream issue report: ClickHouse/ClickHouse#64239. Clicking the first link and "integration_run_parallel1_0.log" brings up the detailed report: https://s3.amazonaws.com/clickhouse-test-reports/64199/96ebaa17d33a059d8da6a48c2fffdd8161e83238/integration_tests__tsan__[4_6]//home/ubuntu/actions-runner/_work/_temp/test/output_dir/integration_run_parallel1_0.log I also included it below for reference.

We are using this exact OpenSSL branch: https://github.com/ClickHouse/openssl/tree/ClickHouse/openssl-3.2.1

The issue looks similar to #19326 and #21527 (but I am not really an OpenSSL expert).

E           Exception: Sanitizer assert found for instance ==================
E           WARNING: ThreadSanitizer: data race (pid=3978)
E             Read of size 8 at 0x72200015c5a0 by thread T688 (mutexes: write M0):
E               #0 sa_doall build_docker/./contrib/openssl/crypto/sparse_array.c:86:30 (clickhouse+0x200cc67b) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #1 ossl_sa_doall_arg build_docker/./contrib/openssl/crypto/sparse_array.c:148:9 (clickhouse+0x200cc67b)
E               #2 ossl_sa_ALGORITHM_doall_arg build_docker/./contrib/openssl/crypto/property/property.c:97:1 (clickhouse+0x20098171) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #3 ossl_method_store_do_all build_docker/./contrib/openssl/crypto/property/property.c:490:9 (clickhouse+0x20098171)
E               #4 evp_generic_do_all build_docker/./contrib/openssl/crypto/evp/evp_fetch.c:621:5 (clickhouse+0x20020d5c) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #5 EVP_KEYMGMT_do_all_provided build_docker/./contrib/openssl/crypto/evp/keymgmt_meth.c:298:5 (clickhouse+0x2002ce87) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #6 ossl_decoder_ctx_setup_for_pkey build_docker/./contrib/openssl/crypto/encode_decode/decoder_pkey.c:441:5 (clickhouse+0x1fff9905) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #7 OSSL_DECODER_CTX_new_for_pkey build_docker/./contrib/openssl/crypto/encode_decode/decoder_pkey.c:803:16 (clickhouse+0x1fff9905)
E               #8 x509_pubkey_ex_d2i_ex build_docker/./contrib/openssl/crypto/x509/x_pubkey.c:208:14 (clickhouse+0x2010f534) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #9 asn1_item_embed_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:262:20 (clickhouse+0x1ff6ad8d) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #10 asn1_template_noexp_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:682:15 (clickhouse+0x1ff6c971) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #11 asn1_template_ex_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:558:16 (clickhouse+0x1ff6b83d) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #12 asn1_item_embed_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:422:19 (clickhouse+0x1ff6b209) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #13 asn1_template_noexp_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:682:15 (clickhouse+0x1ff6c971) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #14 asn1_template_ex_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:558:16 (clickhouse+0x1ff6b83d) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #15 asn1_item_embed_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:422:19 (clickhouse+0x1ff6b209) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #16 asn1_item_ex_d2i_intern build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:118:10 (clickhouse+0x1ff6a9ab) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #17 ASN1_item_d2i_ex build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:144:9 (clickhouse+0x1ff6a9ab)
E               #18 ASN1_item_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:154:12 (clickhouse+0x1ff6a9ab)
E               #19 d2i_X509 build_docker/./contrib/openssl/crypto/x509/x_x509.c:138:1 (clickhouse+0x2010f670) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #20 tls_process_server_certificate build_docker/./contrib/openssl/ssl/statem/statem_clnt.c:2006:13 (clickhouse+0x1ff456b9) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #21 ossl_statem_client_process_message build_docker/./contrib/openssl/ssl/statem/statem_clnt.c:1100:16 (clickhouse+0x1ff4411f) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #22 read_state_machine build_docker/./contrib/openssl/ssl/statem/statem.c:684:19 (clickhouse+0x1ff3ff07) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #23 state_machine build_docker/./contrib/openssl/ssl/statem/statem.c:478:21 (clickhouse+0x1ff3ff07)
E               #24 ossl_statem_connect build_docker/./contrib/openssl/ssl/statem/statem.c:297:12 (clickhouse+0x1ff3f0ee) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #25 SSL_do_handshake build_docker/./contrib/openssl/ssl/ssl_lib.c:4746:19 (clickhouse+0x1fec6701) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #26 SSL_connect build_docker/./contrib/openssl/ssl/ssl_lib.c:2208:12 (clickhouse+0x1fec6813) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #27 ma_tls_connect build_docker/./contrib/mariadb-connector-c/libmariadb/secure/openssl.c:627:30 (clickhouse+0x1d7c95e4) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
[...]


E             Previous write of size 8 at 0x72200015c5a0 by thread T678 (mutexes: write M1, write M2, write M3):
E               #0 ossl_sa_set build_docker/./contrib/openssl/crypto/sparse_array.c:214:8 (clickhouse+0x200cca9d) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #1 ossl_sa_ALGORITHM_set build_docker/./contrib/openssl/crypto/property/property.c:97:1 (clickhouse+0x20097ce0) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #2 ossl_method_store_insert build_docker/./contrib/openssl/crypto/property/property.c:286:12 (clickhouse+0x20097ce0)
E               #3 ossl_method_store_add build_docker/./contrib/openssl/crypto/property/property.c:344:14 (clickhouse+0x20097ce0)
E               #4 put_evp_method_in_store build_docker/./contrib/openssl/crypto/evp/evp_fetch.c:191:12 (clickhouse+0x200212ab) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #5 ossl_method_construct_this build_docker/./contrib/openssl/crypto/core_fetch.c:123:5 (clickhouse+0x1fff8be4) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #6 algorithm_do_map build_docker/./contrib/openssl/crypto/core_algorithm.c:77:13 (clickhouse+0x1fff8648) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #7 algorithm_do_this build_docker/./contrib/openssl/crypto/core_algorithm.c:122:15 (clickhouse+0x1fff8648)
E               #8 ossl_provider_doall_activated build_docker/./contrib/openssl/crypto/provider_core.c:1483:14 (clickhouse+0x2009fa63) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #9 ossl_algorithm_do_all build_docker/./contrib/openssl/crypto/core_algorithm.c:162:9 (clickhouse+0x1fff843b) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #10 ossl_method_construct build_docker/./contrib/openssl/crypto/core_fetch.c:153:5 (clickhouse+0x1fff88ce) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #11 inner_evp_generic_fetch build_docker/./contrib/openssl/crypto/evp/evp_fetch.c:313:23 (clickhouse+0x2002035e) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #12 evp_generic_fetch build_docker/./contrib/openssl/crypto/evp/evp_fetch.c:378:14 (clickhouse+0x20020082) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #13 EVP_KDF_fetch build_docker/./contrib/openssl/crypto/evp/kdf_meth.c:162:12 (clickhouse+0x200293c7) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #14 tls13_generate_secret build_docker/./contrib/openssl/ssl/tls13_enc.c:181:11 (clickhouse+0x1fee3c67) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #15 ssl_gensecret build_docker/./contrib/openssl/ssl/s3_lib.c:4854:18 (clickhouse+0x1feb76c5) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #16 ssl_derive build_docker/./contrib/openssl/ssl/s3_lib.c:4907:14 (clickhouse+0x1feb76c5)
E               #17 tls_parse_stoc_key_share build_docker/./contrib/openssl/ssl/statem/extensions_clnt.c:1885:13 (clickhouse+0x1ff35a7f) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #18 tls_parse_extension build_docker/./contrib/openssl/ssl/statem/extensions.c:765:20 (clickhouse+0x1ff2de8e) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #19 tls_parse_all_extensions build_docker/./contrib/openssl/ssl/statem/extensions.c:799:14 (clickhouse+0x1ff2df88) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #20 tls_process_server_hello build_docker/./contrib/openssl/ssl/statem/statem_clnt.c:1744:10 (clickhouse+0x1ff45088) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #21 ossl_statem_client_process_message build_docker/./contrib/openssl/ssl/statem/statem_clnt.c:1094:16 (clickhouse+0x1ff4412c) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #22 read_state_machine build_docker/./contrib/openssl/ssl/statem/statem.c:684:19 (clickhouse+0x1ff3ff07) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #23 state_machine build_docker/./contrib/openssl/ssl/statem/statem.c:478:21 (clickhouse+0x1ff3ff07)
E               #24 ossl_statem_connect build_docker/./contrib/openssl/ssl/statem/statem.c:297:12 (clickhouse+0x1ff3f0ee) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #25 SSL_do_handshake build_docker/./contrib/openssl/ssl/ssl_lib.c:4746:19 (clickhouse+0x1fec6701) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #26 SSL_connect build_docker/./contrib/openssl/ssl/ssl_lib.c:2208:12 (clickhouse+0x1fec6813) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #27 ma_tls_connect build_docker/./contrib/mariadb-connector-c/libmariadb/secure/openssl.c:627:30 (clickhouse+0x1d7c95e4) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
[...]

This appears to be a bug in ossl_method_store_do_all

It seems I previously encountered this when developing #24344 (in particular see 5d492e0) - but since that PR is currently abandoned the bug I discovered along the way got forgotten about.

Probably some solution similar to the approach I came up with in that PR is the way ahead (but that PR was using an RCU lock which has not been adopted in master (yet)).

Something I've found very helpful in BoringSSL is to run with TSan in CI and then write tests that specifically exercise subtle APIs intended to be used across threads. We have a lot less shared mutable state than OpenSSL (less complex and more performant; see the various 3.x perf regressions), so there's less of this sort of thing in the first place, but I think that strategy would apply here too. It might help you all avoid these kinds of issues from happening in the first place.

We do run tsan in CI, and the threadstest is explicitly written to find these kind of issues. But that test is focused on libcrypto. We should probably extend it to do some libssl testing.

Ah yeah, BoringSSL has some thread tests for TLS session resumption, which we have definitely found valuable. Although the race itself seems to be in libcrypto, so it seems to there may be some TSan testing gaps in OpenSSL on the libcrypto side too.

Just to say this out loud, with the exception of ossl_free_leaves, the SA table is really pretty close to being able to be lock free. If ossl_sa_set were modified to use the new CRYPTO_atomic_store api when adding new leaves and to the values themselves, locking around the data structure could be eliminated. The atomic op may be a performance hit, but if we could remove the surrounding locks, we could claw some of that back, and it would resolve the tsan race above.

Except I doubt this is really true in this case. ossl_method_store_do_all iterates over all the ALGORITHMs in the store. We really need a consistent set of ALGORITHMs for the entire operation and we don't want to have to handle changes to the sparse array half way through iterating over it.

This is where some of the discussion in the other bugs about unnecessary shared mutability comes in. A more straightforward way to design this would simply have been:

  1. At the time you load a provider, query all the algorithms and instantiate EVP_MDs, etc., for every one of them. Build an efficient index to map from algorithm names to those EVP_MDs and whatnot.
  2. After all that stuff has been instantiated, keep the entire provider object immutable. The EVP_MDs are fixed, the index is fixed, etc.
  3. Since OpenSSL decided to allow concurrent provider load and use (bad idea), you all are stuck paying for some serialization in the global provider list. However, now that the individual providers are immutable, this synchronization is limited to a single list of O(10) elements. Now techniques like RCU are viable.
  4. Although the provider list itself needs synchronization, your providers themselves are immutable and so they can be queried concurrently without fuss.

This is a pretty general lesson about threaded systems. Shared things should be immutable. Shared, mutable things are an endless source of complexity, synchronization problems, and thread contention. This is why I flagged issues like #23369 as they stand in the way of you all fixing this design problem.

Of course, the immediate issue is a threading problem and the immediate fix is that you all should lock the mutable state that you currently keep mutable. That will likely make things even slower, but the performance problems are just part of the OpenSSL 3.x architecture. To fix those, you have to fix the architecture.

3. Since OpenSSL decided to allow concurrent provider load and use (bad idea), you all are stuck paying for some serialization in the global provider list. However, now that the individual providers are immutable, this synchronization is limited to a single list of O(10) elements. Now techniques like RCU are viable.

Hmmm... thinking loud - perhaps we could disallow concurrent provider load and use in a single library context at least in 4.0. I have no idea how this could be a reasonable operation of any application anyway as that entails to randomly failing operation if a provider is not yet loaded, or randomly changing the provider which will perform the operation, etc.

The bigger problem will be the no-cache flag for queries if we want to instantiate all the provider operations on load - we would have to deem it unsupported/ignored basically.

Also I am not sure how expensive the initial load of providers like default or a general pkcs11 provider will be if there are hundreds of operations implemented - this could be actually prohibitively expensive for simple apps that use just a few operations.

commented

The original design didn't call for any caching at all. I don't think we need to concern ourselves over maintaining the no-cache support, it would be nice to keep but not essential IMO.

I do agree that we shouldn't have made providers anything like as dynamic as they are.

The bigger problem will be the no-cache flag for queries if we want to instantiate all the provider operations on load - we would have to deem it unsupported/ignored basically.

no-cache is a misfeature anyway IMO. Ignoring it seems ok to me.

commented

no-cache is meant to be a space saving feature.

It's also great for testing.

Based on above discussion, is it right that no straightforward fix exists?

Based on above discussion, is it right that no straightforward fix exists?

No, the fix would not be overly complicated. The discussion is only partially related.

Its not a great fix, but I think its the best we can do right now without some significant refactoring:
#24782

@rschu1ze can you test with the attached draft PR, and confirm that the issue is resolved for you please?

I tried to reproduce the issue locally (it happens in one of our integration tests, specifically test_mysql_killed_while_insert_8_0) but I did not even manage to make the test even run on my machine 😢. Since the issue happens only sporadically, my hopes to verify the patch were low anyways.

In any case, I pushed your fix (thanks!) to our OpenSSL fork where it will be subject to our test suite (--> ClickHouse/ClickHouse#66064). We'll need to observe test_mysql_killed_while_insert_8_0 for a while to understand if the fix really helps. I can report back in one or two weeks if it is not urgent.