entered unreachable code with SignalsInfo::forever()

Question

entered unreachable code with SignalsInfo::forever()

anhammer opened this issue a year ago · comments

I got the following panic in one of my unit tests:

thread '<unnamed>' panicked at 'internal error: entered unreachable code: Because of the blocking has_signals method the poll_signal method never returns Poll::Pending but blocks until a signal arrived', /tmp/cargo_home/registry/src/index.crates.io-6f17d22bba15001f/signal-hook-0.3.15/src/iterator/mod.rs:308:36

Unfortunately I can't reproduce the issue anymore.

This was in a unit test, which is running in parallel with other unit tests that also run the signal handling code. My guess would be that it is a problem to have multiple instances that handle signals in the same process, but I didn't find anything in the docs about this. (Apologies if I missed it)

After digging a bit into the code, it seems that the (blocking) has_signals() reads from a pipe and returns the length 0, resulting in the result being interpreted as PollResult::Pending, which should never happen for the blocking call.

I don't remember exactly, but I think there were multiple reasons for receiving a 0:

EOF was reached
The buffer the read data is written to has size 0
The pipe was closed

My guess is that maybe due to the parallel access to a single data structure the pipe was already emptied. But could as well be some kind of race condition when resources are freed.

Michal 'vorner' Vaner · Answer 1 · Fri Jul 07 2023 19:49:10 GMT+0800 (China Standard Time)

My guess would be that it is a problem to have multiple instances that handle signals in the same process, but I didn't find anything in the docs about this. (Apologies if I missed it)

No, the idea is you should be able to have multiple instances safely. This looks like a legitimate bug. I'll try to look into it as soon as possible (though not sure when that will be exactly :-|).

As for reproducing, can you let the unit tests run in a loop until it fails? That is something like while cargo test ; do : ; done? If it happens again, it would be nice to get the full backtrace 😇.

Or, do you have the unit tests somewhere public?

anhammer · Answer 2 · Fri Jul 07 2023 21:24:34 GMT+0800 (China Standard Time)

Running it in loops now, but did not see the issue anymore :( but will write here if a test fails. I also tinkered around a bit with instantiating multiple handlers, adding some sleeps, etc, nothing helped to reproduce.

Unfortunately I can't make the code public, but I made a smaller example that shows more or less what is happening. (The interaction with signal_hook is exactly the same, just some closure and buildstructor stuff is missing)

click here for code:

use signal_hook::consts;
use signal_hook::iterator::Handle;
use signal_hook::iterator::Signals;

struct SignalHandler {
    thread: Option<std::thread::JoinHandle<()>>,
    stop_handle: Option<Handle>,
}

impl Drop for SignalHandler {
    fn drop(&mut self) {
        self.stop();
    }
}

impl SignalHandler {
    fn new(register_signals: &[i32]) -> Result<Self, String> {
        for signal in register_signals {
            Self::check_signal_is_in_range(signal)?;
        }

        let signals = Signals::new(register_signals).unwrap();
        let stop_handle = signals.handle();
        let thread = std::thread::spawn(move || {
            Self::thread_loop(signals);
        });

        Ok(Self {
            thread: Some(thread),
            stop_handle: Some(stop_handle),
        })
    }

    fn check_signal_is_in_range(signal: &i32) -> Result<(), String> {
        if !(1..=31).contains(signal) {
            return Err(String::from("invalid signal '{signal}'"));
        }
        Ok(())
    }

    fn thread_loop(mut signals: Signals) {
        for signal in signals.forever() {
            // minimal example; normally some registered closures would be executed here
            println!("got signal {signal}, wooo!");
        }
    }

    fn stop(&mut self) {
        if let Some(stop_handle) = self.stop_handle.take() {
            stop_handle.close();
        } else {
            println!("stop handle not found");
        }

        if let Some(join_handle) = self.thread.take() {
            join_handle.join().unwrap();
        } else {
            println!("join handle not found")
        }
    }
}

fn main() {
    let signals = [consts::SIGUSR1, consts::SIGUSR2];
    let handler = SignalHandler::new(&signals);
    drop(handler);
}

#[cfg(test)]
mod tests {
    use super::*;

    fn is_thread_running(signal_handler: &SignalHandler) -> bool {
        match &signal_handler.thread {
            Some(join_handle) => !join_handle.is_finished(),
            None => false,
        }
    }

    #[test]
    fn test_stop() {
        let mut signal_handler = SignalHandler::new(&[]).unwrap();
        assert!(is_thread_running(&signal_handler));
        signal_handler.stop();
        assert!(!is_thread_running(&signal_handler));
    }

    #[test]
    fn test_with_registered_signal() {
        // this is the test that failed
        let signals = [consts::SIGUSR1, consts::SIGUSR2];
        let handler = SignalHandler::new(&signals).unwrap();
        drop(handler);
    }

    #[test]
    fn test_reject_invalid_signals() {
        let invalid_signals = [[1, -1], [1, 32]];

        for signals in invalid_signals {
            assert!(SignalHandler::new(&signals).is_err());
        }
    }
}

Maybe worth mentioning is that the tests run inside a docker container, that has rust 1.70 installed. Furthermore, unittests are run twice usually, once on stable with llvm-cov, and once on nightly with ASAN enabled. (ASAN build succeeded).

Also the build failed (only once so far) on a build server. Like mentioned it runs in docker, but there could still have been some side effects (signals?) that were present during the build that triggered the issue. Pretty hard to say... That being said, I was not able to reproduce the issue anymore, neither on my dev machine nor on the build machine.

Command line for unit tests:
cargo llvm-cov --no-report --lib --bins

Command line for unit tests with ASAN enabled:

RUSTFLAGS="-Z sanitizer=address" cargo \
    +nightly-2023-06-08 \
    test \
    --target=x86_64-unknown-linux-gnu \
    --release \
    --lib \
    --bins

edit: signal hook entry in Cargo.toml:
signal-hook = {version = "=0.3.15"}

Hope this helps

Michal 'vorner' Vaner · Answer 3 · Sat Jul 15 2023 16:28:03 GMT+0800 (China Standard Time)

I think I've discovered what might have happened. As it isn't possible to easily reproduce, I'd ask for you to read through it if the explanation makes sense to you (in #149). I'd then merge and release.

Pantos · Answer 4 · Tue Jul 18 2023 01:36:00 GMT+0800 (China Standard Time)

Heyhey, anhammer here. Having some trouble with posting with my account, so quickly created a new one. (will stick to it for now...)

So the reason behind it makes definitely sense to me. I guess this is because the function that is doing the read can't distinguish between EOF and closed state? (and interpretes closed as EOF/pending)

Looks good to me :)

edit: I think the PR would fix the issue (if it is not the issue, it is a case that needs to be handled anyway). But I was wondering about the loop. It can't loop endlessly, but if a bug is introduced elsewhere in the future, this maybe could result in a hang, which is not so nice, compared to a panic. Would it maybe make sense to just call next() in the PollResult::Pending case, and assert that it returns PollResult::Closed?

edit2: Ah nevermind, I forgot the borrow checker. Gotta do more thinking before writing :P

Michal 'vorner' Vaner · Answer 5 · Tue Jul 18 2023 23:34:03 GMT+0800 (China Standard Time)

OK, released as 0.3.17.