NixOS / ofborg

@ofborg tooling automation https://monitoring.ofborg.org/dashboard/db/ofborg

Home Page:https://ofborg.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

aarch64 community box dropping connection to AMQP host

cole-h opened this issue · comments

Recently, the aarch64 builders have been shrinking over time until the next redeploy brings them back to life. After a bit of debugging, we noticed that the connection to the AMQP host is somehow lost, but the builder doesn't exit. Heartbeats should have helped in this situation but didn't. When looking at the connections using ss, we see the following on a busted builder:

Netid State      Recv-Q Send-Q                              Local Address:Port                   Peer Address:Port   Process
u_str ESTAB      0      0                                               * 44148                             * 24754   users:(("builder",pid=25341,fd=2),("builder",pid=25341,fd=1),("grahamcofborg-b",pid=25335,fd=2),("grahamcofborg-b",pid=25335,fd=1))

and the following on a working builder:

Netid State      Recv-Q Send-Q                              Local Address:Port                   Peer Address:Port   Process
u_str ESTAB      0      0                                               * 80906                             * 33026   users:(("builder",pid=24827,fd=2),("builder",pid=24827,fd=1),("grahamcofborg-b",pid=24816,fd=2),("grahamcofborg-b",pid=24816,fd=1))
tcp   ESTAB      0      0                                 xxx.xxx.xxx.xxx:47128               xxx.xxx.xxx.xxx:5671    users:(("builder",pid=24827,fd=3))

As you can see, the busted builder has dropped its connection to the AMQP host, while the working builder still has an established connection.


Potentially unrelated, but in the stack of one of the busted builders, we also noticed the following thread:

TID 25367:
#0  0x0000ffff8f0eacd4 pthread_cond_wait@@GLIBC_2.17
#1  0x0000aaaae9de77d0 std::thread::park::h1fac58ddd22dac93
#2  0x0000aaaae98ac830 crossbeam_channel::context::Context::wait_until::h47058df4a5256735
#3  0x0000aaaae9987c88 crossbeam_channel::flavors::list::Channel$LT$T$GT$::recv::_$u7b$$u7b$closure$u7d$$u7d$::h02279d7921848ff9
#4  0x0000aaaae98adf18 crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h6995da6c0885a3c7
#5  0x0000aaaae98aef2c crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::he1284795f72001bb
#6  0x0000aaaae99a9e80 std::thread::local::LocalKey$LT$T$GT$::try_with::h4fdf647ecd5ad711
#7  0x0000aaaae98ad0b4 crossbeam_channel::context::Context::with::hb484fda6ebec39c0
#8  0x0000aaaae9987b88 crossbeam_channel::flavors::list::Channel$LT$T$GT$::recv::hf6bf4df54bc8bfd4
#9  0x0000aaaae998289c crossbeam_channel::channel::Receiver$LT$T$GT$::recv::hec7e7cbef7223cc7
#10 0x0000aaaae9a403f4 lapin::socket_state::SocketState::wait::h1b7cc14ee32195dd
#11 0x0000aaaae995b200 lapin::io_loop::IoLoop::run::h5c2f4a08fb0d30a3
#12 0x0000aaaae995a7c4 lapin::io_loop::IoLoop::start::_$u7b$$u7b$closure$u7d$$u7d$::h5562f2f06733e920
#13 0x0000aaaae98ee454 std::sys_common::backtrace::__rust_begin_short_backtrace::h7549166dff606a08
#14 0x0000aaaae99aba04 std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h29c85edad72c9531
#15 0x0000aaaae9964600 _$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hf5ee88030140dc2f
#16 0x0000aaaae99ac00c std::panicking::try::do_call::h286d7e4a15b0a0ed
#17 0x0000aaaae9a6b038 __rust_try
#18 0x0000aaaae99abe74 std::panicking::try::hc9e0e7a4417d6ead
#19 0x0000aaaae99a94c0 std::panic::catch_unwind::ha411c12cbbd43248
#20 0x0000aaaae99ab61c std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::hdf83f6ce30b202de
#21 0x0000aaaae99b2fb8 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h1dd7bfb988fbf8ab
#22 0x0000aaaae9ded7ac std::sys::unix::thread::Thread::new::thread_start::haf3b724064391ad1
#23 0x0000ffff8f0e43b4 start_thread
#24 0x0000ffff8f0198dc thread_start

Maybe this is related to something panicking, and thus preventing a clean exit (or exit altogether) somehow? Though it looks to me like that's just related to the std::thread::Builder::spawn_unchecked function (probably to panic if anything goes wrong rather than making the caller handle any errors).