DataDog / glommio

Glommio is a thread-per-core crate that makes writing highly parallel asynchronous applications in a thread-per-core architecture easier for rustaceans.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MeshIPC causes valgrind to hang

vlovich opened this issue · comments

I haven't fully traced down what's happening but valgrind seems to randomly hang, but only when using a mesh.

Normal without valgrind:

cargo run
   Compiling valgrind-deadlock v0.1.0 (/home/vlovich/valgrind-deadlock)
    Finished dev [unoptimized + debuginfo] target(s) in 1.77s
     Running `target/debug/valgrind-deadlock`
Started thread 1
Started thread 2
Initializing ping from peer 0
1: Responding pong to 0
1: Sending ping to 0
Thread 1 task results: [()]
Terminating thread 1
1: Stopping - Channel is closed { .. }
Thread 2 task results: [()]
Terminating thread 2

With valgrind:

valgrind target/debug/valgrind-deadlock
==223234== Memcheck, a memory error detector
==223234== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==223234== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==223234== Command: target/debug/valgrind-deadlock
==223234== 
Started thread 2
^C^C^C==223234== 
==223234== Process terminating with default action of signal 2 (SIGINT)
==223234==    at 0x259784: core::cell::RefCell<T>::borrow (cell.rs:863)
==223234==    by 0x215C58: glommio::sys::source::Source::raw (source.rs:303)
==223234==    by 0x19488C: glommio::sys::uring::Reactor::wait (uring.rs:1839)
==223234==    by 0x19EF10: glommio::reactor::Reactor::react (reactor.rs:817)
==223234==    by 0x148506: glommio::parking::Inner::park (parking.rs:87)
==223234==    by 0x1485B8: glommio::parking::Parker::poll_io (parking.rs:64)
==223234==    by 0x188B49: glommio::executor::LocalExecutor::run::{{closure}} (mod.rs:1461)
==223234==    by 0x14996F: scoped_tls::ScopedKey<T>::set (lib.rs:137)
==223234==    by 0x1887BE: glommio::executor::LocalExecutor::run (mod.rs:1439)
==223234==    by 0x18B004: glommio::executor::LocalExecutorPoolBuilder::spawn_thread::{{closure}} (mod.rs:987)
==223234==    by 0x1A262D: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:122)
==223234==    by 0x19C66E: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} (mod.rs:514)
==223234== 
==223234== HEAP SUMMARY:
==223234==     in use at exit: 21,318,672 bytes in 143 blocks
==223234==   total heap usage: 31,045 allocs, 30,902 frees, 35,782,609 bytes allocated
==223234== 
==223234== LEAK SUMMARY:
==223234==    definitely lost: 0 bytes in 0 blocks
==223234==    indirectly lost: 0 bytes in 0 blocks
==223234==      possibly lost: 2,024 bytes in 8 blocks
==223234==    still reachable: 21,316,648 bytes in 135 blocks
==223234==         suppressed: 0 bytes in 0 blocks
==223234== Rerun with --leak-check=full to see details of leaked memory
==223234== 
==223234== For lists of detected and suppressed errors, rerun with: -s
==223234== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

The specific stack trace tends to vary:

^C^C^C==223700== 
==223700== Process terminating with default action of signal 2 (SIGINT)
==223700==    at 0x1D22AC: <crossbeam_channel::select::Token as core::default::Default>::default (select.rs:30)
==223700==    by 0x1EF547: crossbeam_channel::flavors::list::Channel<T>::try_recv (list.rs:433)
==223700==    by 0x209488: crossbeam_channel::channel::Receiver<T>::try_recv (channel.rs:751)
==223700==    by 0x250490: glommio::sys::SleepNotifier::process_foreign_wakes (mod.rs:342)
==223700==    by 0x1E89E4: glommio::sys::uring::Reactor::process_foreign_wakes (uring.rs:1406)
==223700==    by 0x1F7E6B: glommio::reactor::Reactor::process_shared_channels (reactor.rs:772)
==223700==    by 0x1F7FAA: glommio::reactor::Reactor::spin_poll_io (reactor.rs:800)
==223700==    by 0x188EF2: glommio::executor::LocalExecutor::run::{{closure}} (mod.rs:1480)
==223700==    by 0x14996F: scoped_tls::ScopedKey<T>::set (lib.rs:137)
==223700==    by 0x1887BE: glommio::executor::LocalExecutor::run (mod.rs:1439)
==223700==    by 0x18B004: glommio::executor::LocalExecutorPoolBuilder::spawn_thread::{{closure}} (mod.rs:987)
==223700==    by 0x1A262D: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:122)

main.rs: https://gist.github.com/vlovich/fddbd15c52a3b86648688e2fc3d66e30
If you change use_mesh to false the code works fine under valgrind which makes me think it might be some kind of issue with the mesh. The interesting thing is that thread 1 never starts but I don't know enough to think of why that be impacted by the existence of a mesh... Similarly, even if I increase the number of threads, still only one prints as started. Additionally, the channel size doesn't have an impact either.

In my specific code that I'm seeing this in, ctrl-c doesn't even work. I have to pkill -9 -f valgrind.

I wonder if there's some subtle race condition in the events where if the timing happens just right, even non-valgrind would hang.

Is it just me or does it look like it's being triggered by crossbeam? Which would be... interesting..

Added some instrumentation into glommio join. Successful run without valgrind:

Started thread 1
glommio: Connecting receiver
Started thread 2
glommio: Connecting peer
glommio: Peer connected
glommio: Receiver connected
glommio: Peer id = 1, role id = 1
glommio: Peer id = 0, role id = 0
glommio: 0 waiting for sender to 1 to connect
glommio: 1 waiting for sender to 0 to connect
glommio: 0 connected sender to 1
glommio: 0 waiting for receiver to 1 to connect
glommio: 0 connected receiver to 1
glommio: 0 Joined
Initializing ping from peer 0
glommio: 1 connected sender to 0
glommio: 1 waiting for receiver to 0 to connect
glommio: 1 connected receiver to 0
glommio: 1 Joined
1: Responding pong to 0
1: Sending ping to 0
Thread 1 task results: [()]
Terminating thread 1
1: Stopping - Channel is closed { .. }
Thread 2 task results: [()]
Terminating thread 2

Under valgrind:

Started thread 1
Connecting receiver
Started thread 2
Connecting peer
Peer connected
Peer id = 1, role id = 1
1 waiting for sender to 0 to connect
Receiver connected
Peer id = 0, role id = 0
0 waiting for sender to 1 to connect
0 connected sender to 1
0 waiting for receiver to 1 to connect
0 connected receiver to 1
0 Joined
Initializing ping from peer 0
^CKilled

So it seems like the second thread is failing to initialize properly for some reason. @bryandmc I don't see crossbeam anywhere in the stack trace (via target remote | vgdb). I think the second thread that never starts is blocked on io_uring_wait_cqes. There's 2 additional threads that are doing a blocking wait via futex_wait for glommio::sys::blocking::BlockingThreadReq (I'm guessing this is just waiting on the glommio thread to terminate?). It's not clear to me why the second thread never gets the I/O completion...

Interesting. On a recent run where I added even more instrumentation:
shared_channel.rs (290):

    pub async fn send(&self, item: T) -> Result<(), T> {
        println!("{} Waiting for room to send...", self.id);
        let waiter = future::poll_fn(|cx| self.wait_for_room(cx));
        println!("{} awaiting checking for room", self.id);

The first log line is printed and then it hangs. The stack trace shows:

Thread 2 (Thread 21742 (tid 2 VgTs_WaitSys unnamed-1)):
#0  __GI___libc_write (nbytes=30, buf=0x69a80b0, fd=1) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=1, buf=0x69a80b0, nbytes=30) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00000000003fafe8 in std::sys::unix::fd::FileDesc::write () at library/std/src/sys/unix/fd.rs:152
#3  std::sys::unix::stdio::{impl#3}::write () at library/std/src/sys/unix/stdio.rs:39
#4  std::io::stdio::{impl#1}::write () at library/std/src/io/stdio.rs:120
#5  std::io::buffered::bufwriter::BufWriter::flush_buf<std::io::stdio::StdoutRaw> () at library/std/src/io/buffered/bufwriter.rs:166
#6  0x00000000003fbf20 in std::io::buffered::linewritershim::{impl#1}::write_all<std::io::stdio::StdoutRaw> () at library/std/src/io/buffered/linewritershim.rs:269
#7  std::io::buffered::linewriter::{impl#1}::write_all<std::io::stdio::StdoutRaw> () at library/std/src/io/buffered/linewriter.rs:206
#8  std::io::stdio::{impl#14}::write_all () at library/std/src/io/stdio.rs:735
#9  0x00000000003fd835 in std::io::Write::write_fmt::{impl#0}::write_str<std::io::stdio::StdoutLock> () at library/std/src/io/mod.rs:1668
#10 0x000000000041b771 in core::fmt::write () at library/core/src/fmt/mod.rs:1226
#11 0x00000000003fbc24 in std::io::Write::write_fmt<std::io::stdio::StdoutLock> () at library/std/src/io/mod.rs:1679
#12 std::io::stdio::{impl#13}::write_fmt () at library/std/src/io/stdio.rs:715
#13 0x00000000003fc603 in std::io::stdio::{impl#12}::write_fmt () at library/std/src/io/stdio.rs:689
#14 std::io::stdio::print_to<std::io::stdio::Stdout> () at library/std/src/io/stdio.rs:1017
#15 std::io::stdio::_print () at library/std/src/io/stdio.rs:1030
#16 0x0000000000182c21 in glommio::channels::shared_channel::{impl#9}::send::{async_fn#0}<bool> () at /home/vlovich/glommio/glommio/src/channels/shared_channel.rs:291

If I attach gdb and hit continue I get much farther. Not sure why stdio print is hanging...

Finally thread 2 hangs at:

#0  core::sync::atomic::atomic_load<u32> () at library/core/src/sync/atomic.rs:2961
#1  core::sync::atomic::AtomicU32::load () at library/core/src/sync/atomic.rs:2075
#2  std::sys::unix::locks::futex_mutex::Mutex::spin () at library/std/src/sys/unix/locks/futex_mutex.rs:70
#3  std::sys::unix::locks::futex_mutex::Mutex::lock_contended () at library/std/src/sys/unix/locks/futex_mutex.rs:61
#4  0x00000000003fbcdc in std::sys::unix::locks::futex_mutex::Mutex::lock () at library/std/src/sys/unix/locks/futex_mutex.rs:30
#5  std::sys_common::mutex::MovableMutex::raw_lock () at library/std/src/sys_common/mutex.rs:75
#6  std::sys_common::remutex::ReentrantMutex::lock<core::cell::RefCell<std::io::buffered::linewriter::LineWriter<std::io::stdio::StdoutRaw>>> () at library/std/src/sys_common/remutex.rs:103
#7  std::io::stdio::Stdout::lock () at library/std/src/io/stdio.rs:656
#8  std::io::stdio::{impl#13}::write_fmt () at library/std/src/io/stdio.rs:715

and the hack of continuing via gdb doesn't work here. Not sure what's going on.

When I run valgrind with drd, it starts with a complaint about latch and then a bunch of other data race warnings follow: https://gist.github.com/vlovich/b553ce3a2450907b62b135e3bdf8038c. Helgrind has similar complaints: https://gist.github.com/vlovich/4162aefdaaa8dc4a477431d1f44fab5c

I suspect these are false positives though. @glommer any thoughts here?

I took a quick look at this, and I don't have a great explanation yet.

It seems so far that the hang is not on the mesh, but on the shared channel (which is used internally to build the mesh).
It hangs on its connect function.

connect() will essentially make sure that the other side is connected. So what it seems like is happening is that because the other thread never starts, this never connects, so it hangs.

I don't know exactly how valgrind does threading, but based on what it does, it makes sense that it will multiplex everything in a single thread of execution internally. A quick search reveals that this may be the case: https://stackoverflow.com/questions/8663148/valgrind-stalls-in-multithreaded-socket-program

I'll take a look at those articles and see if they yield something valuable.

But tl:dr: the hang seems related to the second thread never starting, which in term happens because thread 1 blocks on waiting for it, and it wouldn't really happen outside valgrind.

The shared channel (channels/shared_channel.rs) has a class, Connector that implements Future.

When the other side connects, peer_id() should return the id of the other side. It is being polled on a loop and always returning 0. At this point my process is using 100% CPU. Even if all the threads end up running on a single core on valgrind, it should emulate a yield. So the biggest mystery here is why the other thread won't start.

Btw I tried --fair-sched=yes and it's still hanging which is the only thing I've found about valgrind. Maybe I should open a bug upstream...

I wonder if there's a more direct repro case we can construct...

I made some progress on this. I narrowed it down to the creation of the companion blocking thread.

in sys/uring.rs you will find the creator of the main reactor, and you will see a line saying:

            blocking_thread: BlockingThreadPool::new(thread_pool_placement, notifier.clone())?,

this blocks and never returns.

This thread is used to execute expensive operations out of the main thread. I still don't know why, but you can easily verify with printlns if this the case for you as well - if yes, we made some progress!

Hmm.... not sure I'm seeing the same thing. What I did was I moved that blocking_thread construction as a standalone line and then added surrounding println:

+        println!("starting blocking thread");
+        let blocking_thread = BlockingThreadPool::new(thread_pool_placement, notifier.clone())?;
+        println!("started blocking thread");
+
         Ok(Reactor {
             main_ring: RefCell::new(main_ring),
             latency_ring: RefCell::new(latency_ring),
             poll_ring: RefCell::new(poll_ring),
             latency_preemption_timeout_src: Cell::new(None),
             throughput_preemption_timeout_src: Cell::new(None),
-            blocking_thread: BlockingThreadPool::new(thread_pool_placement, notifier.clone())?,
+            blocking_thread,

Running this without valgrind:

starting blocking thread
starting blocking thread
started blocking thread
started blocking thread
Started thread 2
Started thread 1
Initializing ping from peer 0
1: Responding pong to 0
1: Sending ping to 0
Thread 1 task results: [()]
Terminating thread 1
1: Stopping - Channel is closed { .. }
Thread 2 task results: [()]
Terminating thread 2

Running this with valgrind:

valgrind --fair-sched=yes target/debug/valgrind-deadlock
==1376864== Memcheck, a memory error detector
==1376864== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1376864== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1376864== Command: target/debug/valgrind-deadlock
==1376864== 
starting blocking thread
starting blocking thread
started blocking thread
Started thread 2
started blocking thread
Started thread 1

without fair-sched:

valgrind target/debug/valgrind-deadlock
==1377287== Memcheck, a memory error detector
==1377287== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1377287== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1377287== Command: target/debug/valgrind-deadlock
==1377287== 
starting blocking thread
starting blocking thread
started blocking thread
Started thread 1
started blocking thread
Started thread 2
Initializing ping from peer 0

What I do wonder though is that in the non-valgrind case, the "Started thread N message" prints after "started blocking thread" whereas in the valgrind case there's typically some interleaving. Of course, I haven't investigated if that's at all important. Just something I thought I'd note.

Now what's interesting is that if I leave a println only after the BlockingThreadPool::new call (but not before) then valgrind looks like this:

valgrind target/debug/valgrind-deadlock
==1378072== Memcheck, a memory error detector
==1378072== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1378072== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1378072== Command: target/debug/valgrind-deadlock
==1378072== 
started blocking thread
Started thread 1

but fair-sched looks different:

valgrind --fair-sched=yes target/debug/valgrind-deadlock
==1378295== Memcheck, a memory error detector
==1378295== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1378295== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1378295== Command: target/debug/valgrind-deadlock
==1378295== 
started blocking thread
started blocking thread
Started thread 1
Started thread 2
Initializing ping from peer 0

Notice the above has no interleaving but still deadlocks...

interesting. I'll keep poking.
For me it was very consistent.

We're narrowing this down.

Actually, nothing even to do with MeshIPC / LocalExecutorPoolBuilder necessarily I think. Here's an even simpler scenario that also seems to hang https://gist.github.com/vlovich/6876632b48df4289eb3e05716f9f431a - only uses tokio sync and a single LocalExecutorBuilder in a background thread. If you remove glommio from being instantiated this code runs fine under valgrind. If you replace glommio with the tokio runtime it still runs under valgrind:

            let rt = tokio::runtime::Builder::new_current_thread()
                .build()
                .unwrap();
            rt.block_on(async move {
                loop {
                    eprintln!("BACKGROUND: tokio Waiting for task to process");
                    let task = match receiver.recv().await {
                        Some(task) => task,
                        None => return,
                    };  

                    return;
                }   
            }); 

It's something to do with the Context / Waker but I thought that was a Rust thing? After instrumenting tokio oneshot internals, I see wake_by_ref being called on the send side but the receive side never actually sees the event. So maybe something to do with how waking Wakers across threads works within glommio is busted under valgrind for some reason?

Well, the latest repro disappears if I use --fair-sched=yes. A deadlock on the stderr futex within the Rust stdlib reappears (even with --fair-sched=yes) if I add a bunch of logging instrumentation within tokio. I'm starting to suspect there's something subtle initializing glommio does that breaks valgrind's futex implementation or something. I had thought maybe around CPU affinities, but I thought using Unbound would bypass that aspect (at least from skimming that code it would). @glommer do you have any tips about anything "non-trivial" glommio might be doing on initialization that might confuse valgrind's futex implementation?

the implementation of the waker is the domain of the executor, so there could be a leak there. That's the code under src/task, and there's a lot of unsafe there...

I also noticed problems with flume channels not processing for the blocking operations (e.g. copy_file_range) under valgrind. If I add an eprintln or std::thread::sleep(std::time::Duration::from_millis(1)); after tx.send_async in BlockingThreadPool::push then the channel seems to work correctly. Very strange (this one is regardless of the --fair-sched=yes).