smfrpc / smf

Fastest RPC in the west

Home Page:http://smfrpc.github.io/smf/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

transient segfault observed

dotnwat opened this issue · comments

I've seen this on Travis before. Will look at it in detail.

Last time was that Travis ran out of disk space.

Thanks for reporting

looked at the code.

seastar::future<> wal_partition_manager::open() {
  LOG_DEBUG("Opening partition manager with: opts={}, topic={}, partition={}",
            opts, topic, partition);
  seastar::sstring dir = topic + "." + seastar::to_sstring(partition);
  return file_exists(dir)
    .then([dir](bool exists) {
      if (exists) { return seastar::make_ready_future<>(); }
      return seastar::make_directory(dir).then_wrapped(
        [](auto _) { return seastar::make_ready_future<>(); });
    })
    .then_wrapped([this](auto _) { return this->do_open(); });
}

have u managed to repro locally?

hmm come to think of it, you could create the dir multiple times, but that seems OK w/ me. in fact looking at the code, i then_wrapped it explicitly which means I'm explicitly ignoring this 'possible' failure.

alternatively - and correctly, i should probably serialize this one step to core 0 since all the cores will use this as the begining stage.

all the other dir making funcs are correct and have a lcore-postfix which ensures that only that one handling thread is opening or making files / directories.

i haven't seen it locally, and even on travis it only happened once out of a few runs. but that stack trace is totally obnoxious. have you figured out how to get symbols in the seastar back traces? i seem to recall not having them even when i was using the debug build.

yeah, you have to use seastar-addr2line program in the $SEASTAR/scripts/ folder.

how do you do it for zlog, do you just run docker .ci/local.sh and then docker run -it bash on the build image. I notice in the scripts we do docker commit which will generate a hash that you can attach a shell via docker run commit-id.


edited for clarity.

i'm a little confused by your question. are you asking how to reproduce just one of the build configurations on travis? what is hash<>?

You wouldn't understand because I wrote unintelligible things, sorry.

What I meant to ask is: how do you debug crashes in zlog for non native builds (builds that are not your development environment). I.e.: are you using docker to recreate the failures and then debug them also using docker. Or are you using docker for cross builds but virtual machine to debug.... Was just wondering

I used the addr2line w/ a debug build of master on my computer and it's hard to tell exactly without having the same environment.

seastar::future<seastar::lw_shared_ptr<smf::wal_partition_manager> >::~future() at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:800                                                                                                                        
 (inlined by) _ZZN7seastar6futureIJNS_13lw_shared_ptrIN3smf21wal_partition_managerEEEEE12then_wrappedINS5_12finally_bodyIZNS2_16wal_impl_details22partition_manager_list11get_managerEjEUlvE0_Lb0EEES5_EET0_OT_ENUlSE_E_clINS_12future_stateIJS4_EEEEEDaSE_ at /home/agallego/workspace/sm
f/build_debug/../meta/tmp/seastar/core/future.hh:946                                                                                                                                                                                                                                      
_ZN7seastar12continuationIZNS_6futureIJEE12then_wrappedIZNS_7shardedIN3smf21write_ahead_log_proxyEE5startIJNS5_8wal_optsEEEES2_DpOT_EUlS2_E0_S2_EET0_OT_EUlSG_E_JEEC2EOSH_ at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:394                             
_ZN7seastar12continuationIZNS_6futureIJEE12then_wrappedIZNS_7shardedIN3smf21write_ahead_log_proxyEE5startIJNS5_8wal_optsEEEES2_DpOT_EUlS2_E0_S2_EET0_OT_EUlSG_E_JEEC2EOSH_ at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:394                             
std::move_iterator<std::pair<fmt::BasicStringRef<char>, fmt::internal::Arg>*>::operator*() const at /usr/include/c++/7/bits/stl_iterator.h:1050                                                                                                                                           
seastar::future<>::state() at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:721                                                                                                                                                                             
 (inlined by) seastar::future<>::failed() at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:867                                                                                                                                                              
 (inlined by) seastar::future<>::~future() at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:799                                                                                                                                                             
 (inlined by) seastar::future<> seastar::future<>::then_wrapped<seastar::smp_message_queue::async_work_item<seastar::future<> seastar::sharded<smf::write_ahead_log_proxy>::invoke_on_all<>(seastar::future<> (smf::write_ahead_log_proxy::*)())::{lambda(unsigned int)#1}::operator()(uns
igned int) const::{lambda()#1}>::process()::{lambda(auto:1&&)#1}, seastar::future<> >(seastar::smp_message_queue::async_work_item<seastar::future<> seastar::sharded<smf::write_ahead_log_proxy>::invoke_on_all<>(seastar::future<> (smf::write_ahead_log_proxy::*)())::{lambda(unsigned i
nt)#1}::operator()(unsigned int) const::{lambda()#1}>)::{lambda(seastar::smp_message_queue::async_work_item<seastar::future<> seastar::sharded<smf::write_ahead_log_proxy>::invoke_on_all<>(seastar::future<> (smf::write_ahead_log_proxy::*)())::{lambda(unsigned int)#1}::operator()(uns
igned int) const::{lambda()#1}>)#1}::operator()<seastar::future_state<> >(auto, seastar::smp_message_queue::async_work_item<seastar::future<> seastar::sharded<smf::write_ahead_log_proxy>::invoke_on_all<>(seastar::future<> (smf::write_ahead_log_proxy::*)())::{lambda(unsigned int)#1}
::operator()(unsigned int) const::{lambda()#1}>) at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:946                                                                                                                                                       
seastar::future<>::state() at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:721                                                                                                                                                                             
 (inlined by) seastar::future<>::failed() at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:867                                                                                                                                                              
 (inlined by) seastar::future<>::~future() at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:799                                                                                                                                                             
 (inlined by) seastar::future<> seastar::future<>::then_wrapped<seastar::smp_message_queue::async_work_item<seastar::future<> seastar::sharded<smf::write_ahead_log_proxy>::invoke_on_all<>(seastar::future<> (smf::write_ahead_log_proxy::*)())::{lambda(unsigned int)#1}::operator()(uns
igned int) const::{lambda()#1}>::process()::{lambda(auto:1&&)#1}, seastar::future<> >(seastar::smp_message_queue::async_work_item<seastar::future<> seastar::sharded<smf::write_ahead_log_proxy>::invoke_on_all<>(seastar::future<> (smf::write_ahead_log_proxy::*)())::{lambda(unsigned i
nt)#1}::operator()(unsigned int) const::{lambda()#1}>)::{lambda(seastar::smp_message_queue::async_work_item<seastar::future<> seastar::sharded<smf::write_ahead_log_proxy>::invoke_on_all<>(seastar::future<> (smf::write_ahead_log_proxy::*)())::{lambda(unsigned int)#1}::operator()(uns
igned int) const::{lambda()#1}>)#1}::operator()<seastar::future_state<> >(auto, seastar::smp_message_queue::async_work_item<seastar::future<> seastar::sharded<smf::write_ahead_log_proxy>::invoke_on_all<>(seastar::future<> (smf::write_ahead_log_proxy::*)())::{lambda(unsigned int)#1}
::operator()(unsigned int) const::{lambda()#1}>) at /home/agallego/workspace/smf/build_debug/../meta/tmp/seastar/core/future.hh:946                                                                                                                                                       
?? ??:0                                                                                                                                                                                                                                                                                   
?? ??:0                                                                                                                                                                                                                                                                                   
?? ??:0                                                                                                                                                                                                                                                                                   
?? ??:0                                                                                                                                                                                                                                                                                   
?? ??:0 

I'm going to solve the contention issue as well.

So we already create a folder per lcore. What wasn't safe was the $ROOT dir of the write-ahead-log.

  seastar::sstring dir = topic + "." + seastar::to_sstring(partition);

we just had to serialize the creating of the root directory. I wouldn't be surprised if this was a filesystem bug.

Thanks for reporting.

Let me know if you have thoughts on the patch.

I always try to reproduce the bug locally in the same docker execution environment in which the bug occurred remotely in travis. Or in my normal dev environment if the bug looks simple enough. When you reproduce it locally in the docker container, you can jump in that get a gdb session open.

In zlog I invoke addr2line from a segfault handler (https://github.com/cruzdb/zlog/blob/4b77b43f776a47aab796ada48b4f9640498e8c1d/src/port/stack_trace.cc#L58) to get the symbols to print out correctly. That makes life way easier for getting an idea of what happened remotely.

For seastar / smf debugging I guess we'll need to choose between a few options

  1. it looks like we might just be able to pipe the output of stderr for test runs through seastar-addr2line since it just echos lines that don't contain symbols (https://github.com/scylladb/seastar/blob/e60afd24a1f120dab50959748fe58a8791ae4e0d/scripts/seastar-addr2line#L51). if that works, its probably the easiest solution.

  2. could dump a symbol table on vm startup

  3. on failure we could upload the symbol table or a core image to some where temporary, like an s3 object or something. there are a few options.

the reasoning behind that patch, for delegating mkdir to core0, makes total sense. i'm still a total seastar newbie, but at a glance it seems good.

I haven't been able to reproduce the bug yet. But I found a couple other enhancements on the write ahead log.

I'veen working on porting a modified version of the Kernel's clock-pro.

I'll submit patches in a couple of PR.

Anyway, the race to create all directories should be fixed correctly soon.

Thanks for reporting.

@noahdesu I just merged the code that correctly handles ownership of creating directories.

I also really like your addr2line, maybe we can add it to the test suite like you said.

Given that the filesystem API has a new flow, I'm going to close this ticket.

Let's reopen in case it happens again.