Multithreaded code that calls `Stronghold::load_client` panics in various ways

Question

Multithreaded code that calls `Stronghold::load_client` panics in various ways

PhilippGackstatter opened this issue 2 years ago · comments

Philipp Gackstatter commented 2 years ago

Bug description

Multi-threaded code that uses Stronghold::load_client panics in various ways, depending on some parameters.

Rust version

Which version of Rust are you running?

Rust version: 1.60.0

Stronghold version

Which version of Stronghold are you using?

Current dev-refactor branch, commit: 629466da.

Hardware specification

What hardware are you using?

Operating system: Ubuntu 20.04

If hardware details are important, I'm happy to provide them.

Steps To reproduce the bug

Explain how the maintainer can reproduce the bug.

Run the following test multiple times. On my machine, the results were different across executions.

#[test]
fn test_stronghold_multi_threading() {
    let client_path = b"client_path".to_vec();
    let client_path2 = b"client_path2".to_vec();

    let stronghold1 = Stronghold::default();

    let client = stronghold1.create_client(&client_path).unwrap();
    let client2 = stronghold1.create_client(&client_path2).unwrap();

    stronghold1.write_client(&client_path).unwrap();
    stronghold1.write_client(&client_path2).unwrap();

    let stronghold2 = stronghold1.clone();

    let t1 = std::thread::spawn(move || {
        for i in 0..20 {
            let cl = stronghold1.load_client(&client_path).unwrap();
            cl.store().insert(b"test".to_vec(), b"value".to_vec(), None).unwrap();
        }
    });

    let t2 = std::thread::spawn(move || {
        for i in 0..20 {
            let cl = stronghold2.load_client(&client_path2).unwrap();
            cl.store().insert(b"test".to_vec(), b"value".to_vec(), None).unwrap();
        }
    });

    t1.join().unwrap();
    t2.join().unwrap();
}

Expected behaviour

Test always finishes successfully, i.e. without panics.

Actual behaviour

I observed 5 different results:

Test binary finishes successfully (exit code 0)
thread '' panicked at 'called Result::unwrap() on an Err value: LockAcquireFailed', client/src/tests/interface_tests.rs:162:60
thread '' panicked at 'Releases exceeded retains', engine/runtime/src/boxed.rs:188:9
Caused by: process didn't exit successfully: ~/git/stronghold.rs/target/debug/deps/iota_stronghold-18f1a25c50491033 test_stronghold_multi_threading --nocapture (signal: 11, SIGSEGV: invalid memory reference)
fish: Job 1, './target/debug/deps/iota_strong…' terminated by signal SIGSEGV (Address boundary error)

Regarding case 2: The current stronghold code uses try_lock everywhere, which means that if a lock cannot be acquired immediately, an error is returned to a user. In my opinion, this should be changed to a lock call, i.e. the call should block until the lock can be acquired. It should not be the stronghold user's responsibility to implement retry behaviour. This error was the original cause for writing the above test, which turned out to result in other behaviour, too.

The issue seems to come from Stronghold::load_client invocations, and in turn from the Snapshot::get_state call, which calls KeyStore::get_key. I think somewhere in that code area, the issue originates from, though I did not track it any further. The call to the store can be replaced with a procedure execution and will result in the same behaviour, so that specific line is most likely not the cause of the errors.

Matthias Kandora · Answer 1 · Mon May 23 2022 15:55:16 GMT+0800 (China Standard Time)

Hey @PhilippGackstatter thanks for reporting this bug.

As we discussed earlier, using mutexes / locks was intended to be only a temporary solution. try_lock() was used as an mechanism to always fail with an error, instead of running into deadlocks with lock()

thread '' panicked at 'Releases exceeded retains', engine/runtime/src/boxed.rs:188:9
This is a race condition.

With rust >= 1.61.x the currently employed std::sync::Mutex are denied by clippy, because this MutexGuard is held across an await point.

Solving this bug would either require proof of absence of dead locks, or finalizing the stm based approach.

Philipp Gackstatter · Answer 2 · Thu Oct 13 2022 18:42:11 GMT+0800 (China Standard Time)

I can confirm the test I attached in the issue runs (with minimal changes due to the changed interface, i.e. replacing load_client with get_client) and the example that we deactivated due to this issue also runs. Thank you!