Decryption support

Question

Decryption support

furuholm opened this issue 5 years ago · comments

I have been working on adding support for decrypting messages and if it is ok I would like to submit a PR with this functionality once I am done.

I have an initial version that almost works at https://github.com/furuholm/libsignal-protocol-rs/tree/decryption. However, it turns out that the decryption code in the underlying C library requires the mutex implementation to be reentrant which parking_lot::RawMutex is not. When replacing RawMutex with ReentrantMutex I get the following error message

error[E0277]: the type `std::cell::UnsafeCell<usize>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary
   --> libsignal-protocol/src/context.rs:410:17
    |
410 |         let _ = std::panic::catch_unwind(|| {
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^ `std::cell::UnsafeCell<usize>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary
    |
    = help: within `&context::State`, the trait `std::panic::RefUnwindSafe` is not implemented for `std::cell::UnsafeCell<usize>`
    = note: required because it appears within the type `std::cell::Cell<usize>`
    = note: required because it appears within the type `lock_api::remutex::RawReentrantMutex<parking_lot::raw_mutex::RawMutex, parking_lot::remutex::RawThreadId>`
    = note: required because it appears within the type `lock_api::remutex::ReentrantMutex<parking_lot::raw_mutex::RawMutex, parking_lot::remutex::RawThreadId, ()>`
    = note: required because it appears within the type `context::State`
    = note: required because it appears within the type `&context::State`
    = note: required because of the requirements on the impl of `std::panic::UnwindSafe` for `&&context::State`
    = note: required because it appears within the type `[closure@libsignal-protocol/src/context.rs:410:42: 413:10 state:&&context::State, level:&log::Level, message:&&str]`

Just like the message says it is caused by RawReentrantMutex containing a cell::Cell which is not RefUnwindSafe.

Any ideas on how to solve this?

Michael Bryan · Answer 1 · Fri Dec 20 2019 17:29:37 GMT+0800 (China Standard Time)

I have been working on adding support for decrypting messages and if it is ok I would like to submit a PR with this functionality once I am done.

That's awesome! I'd be happy to add more functionality to this crate.

Any ideas on how to solve this?

Looking at the docs for std::panic::UnwindSafe, my interpretation is the RefUnwindSafe bound exists as a speed bump so we don't accidentally break invariants by accessing data (in this case I'm guessing its some sort of recursion depth counter) when code panics.

The error message involves recursive locks so I'm assuming we could accidentally deadlock if we naively used std::panic::AssertUnwindSafe to make the compiler think our mutex is RefUnwindSafe.

You may want to make an issue against the parking_lot crate directly and ask them how to handle unwinding while a re-entrant lock is held. Normally you'd use some sort of poisoning mechanism, but I don't know enough about the subtleties of locking to a solution.

The standard library's Mutex implements poisoning, so they've manually implemented UnwindSafe, maybe parking_lot needs to do that too?

Tobias Furuholm · Answer 2 · Sat Dec 21 2019 20:17:50 GMT+0800 (China Standard Time)

Thanks for the feedback!

The solution was quite simple. I just moved the acquiring of log_func out of the catch_unwind scope. See line 411 here.

Once this was fixed I could replace the current mutex implementation with a reentrant mutex. This type only provided a RAII-based API. I put the lockguards in a Vec to keep them alive until unlock is called. Note that unlock is not protected, but my thinking was that unlock should never be invoked unless the lock has been aquired by lock, so I think that should be ok.

That was the easy part. Testing this turned out to be much more challenging. I tried two approaches:

Test Context directly: This turned out to be challenging as Rust does its best to stop us from sharing references between threads. The solution I came up with has a race condition that is caused by sharing Context rather than the lock itself.
Note that I added Sync and Send implementations to Context here to allow passing it between threads. This is something that needs to be to studied in more detail.
Test passing encrypted messages between two threads: This works but it is hard to say if it proves anything about the mutex impementation.
Note: to avoid adding Sync and Send to PreKeyBundle I implemented a wrapper (PreKeyBundleWrapper). This should probably be cleaned up in some way before (potentially) adding this test to master.

I created a branch for each of the above since the first is broken and the second is kind of flawed.

The reentrant mutex implementation is however pushed to https://github.com/furuholm/libsignal-protocol-rs/tree/decryption.

Do you have any suggestion on how to proceed? Do you want me to create a PR without the tests or should we fix them first?

Michael Bryan · Answer 3 · Sat Dec 21 2019 21:38:53 GMT+0800 (China Standard Time)

The solution was quite simple. I just moved the acquiring of log_func out of the catch_unwind scope. See line 411 here.

You've got to be careful here because if the lock is poisoned then calling unwrap() will trigger a panic.

Moving log_func outside of catch_unwind means a poisoned lock will now unwind into the calling function (written in C). Unwinding across the FFI boundary is UB, so any Rust code which may panic needs to be executed inside a catch_unwind.

You could remove the unwrap() and pattern match on the result of lock() though.

if let Ok(message) = std::str::from_utf8(buffer) {
        // we can't log the errors that occur while logging errors, so just
        // drop them on the floor...
        if let Ok(log_func) = state.log_func.lock() {
            let _ = std::panic::catch_unwind(|| {
                let log_func = state.log_func.lock().unwrap();
                log_func(level, message);
            });
        }
    }

That was the easy part. Testing this turned out to be much more challenging.

How do people normally test that locking is done correctly? I know one method is to verify using logical analysis or special tools like go's race detector, but that's not my area of expertise.

Another question to ask is whether we even need to test it. We could argue that using locking correctly is libsignal-protocol-c's responsibility.

Adding Send and Sync implementations isn't something I'd do lightly, so we may want to get another opinion here. I know you need to manually implement Send and Sync for types containing raw pointers because it acts as a speed bump for people writing FFI code, letting them know they need to check the C code being interacted with is also thread-safe.

Another thing to keep in mind is if our Context is !Sync we can statically ensure it won't be sent across threads. So all lock() and unlock() functions could be safely rewritten as no-ops.

However, it turns out that the decryption code in the underlying C library requires the mutex implementation to be reentrant which parking_lot::RawMutex is not.

How did you initially determine we need a reentrant lock? If we can remove the requirement for reentrant locking these problems should all go away.

Tobias Furuholm · Answer 4 · Sun Dec 22 2019 17:32:28 GMT+0800 (China Standard Time)

I will play the Rust noob card here, but not spotting that I moved an unwrap out from a catch_unwind is kind of bad 😬. Your suggestion is obvioulsy the correct one. Good catch!

Using the lockless approach sounds like an excelent idea! One thread, one Context!

Went ahead and implemented this. I added a comment about the reasoning behind the noop locking strategy inside lock_function. Should this be documented somewhere else as well?

If we go this path I think the other dissucssion is kind of obsolete, but I still want to address a couple of your questions.

The remutex is (was) needed as decrypt_pre_key_message invokes session_cipher_decrypt_pre_key_signal_message that in turn invokes session_cipher_decrypt_from_record_and_signal_message. The two latter ones both invokes signal_lock (which uses Context::lock_function).
Regarding testing of mutexes I tried to take inspiration from the tests in parking_lot. The issue I had was having to wrap the lock under test in another lock to be able to transfer it btw threads. This did not end well :)

Note: The noop locking is now part of the decryption branch. Depending on your preferences this could potentially be a separate PR though.

Michael Bryan · Answer 5 · Mon Dec 23 2019 07:40:21 GMT+0800 (China Standard Time)

All good, it's always nice to have a second pair of eyes look over your code 😁

I added a comment about the reasoning behind the noop locking strategy inside lock_function. Should this be documented somewhere else as well?

How about using static_assertions::assert_not_impl_all!(Context: Send, Sync) to assert that our Context type isn't thread-safe? Then just above it you'll be able to mention why we use no-op locking and link to this conversation. That way we can ensure future changes won't accidentally make the Context type Send.

The noop locking is now part of the decryption branch. Depending on your preferences this could potentially be a separate PR though.

If you'd like to land it all in one big PR or break each set of changes up into their own smaller PRs I'm happy either way. I guess smaller PRs might make code review easier though.

Michael Bryan · Answer 6 · Sun May 31 2020 15:56:46 GMT+0800 (China Standard Time)

Implemented in #57.