shadow / tgen

A powerful traffic generator that can model complex behaviors using Markov models and an action-dependency graph.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tgen-stream.c:1449:_tgenstream_onWritable: assertion failed: (_tgenstream_getTime(stream) >= stream->send.deferBarrierMicros)

sporksmith opened this issue · comments

I tripped this assertion in two tgen instances in a large-ish shadow simulation. They were both onion-service servers. Here's the failed simulation job: https://gitlab.torproject.org/jnewsome/sponsor-61-sims/-/jobs/149713

End of stdout:

2000-01-01 00:15:51 946685751.677612 [message] [tgen-stream.c:1633] [_tgenstream_log] [stream-success] transport [fd=494,local=localhost:127.0.0.1:80,proxy=NULL:0.0.0.0:0,remote=localhost:127.0.0.1:38376,state=SUCCESS_OPEN,error=NONE] stream [id=11462,vertexid=passive-stream:traffic,name=server77onionservice,peername=markovclient807onionservice,sendsize=0,recvsize=0,sendstate=SEND_SUCCESS,recvstate=RECV_SUCCESS,error=NONE] bytes [total-bytes-recv=8875,total-bytes-send=107632,payload-bytes-recv=8604,payload-bytes-send=107550,payload-progress-recv=100.00%,payload-progress-send=100.00%] times [created-ts=946685750153565,usecs-to-socket-create=0,usecs-to-socket-connect=0,usecs-to-proxy-init=-1,usecs-to-proxy-choice=-1,usecs-to-proxy-request=-1,usecs-to-proxy-response=-1,usecs-to-command=-1,usecs-to-response=792461,usecs-to-first-byte-recv=792461,usecs-to-last-byte-recv=1524047,usecs-to-checksum-recv=-1,usecs-to-first-byte-send=792461,usecs-to-last-byte-send=1473773,usecs-to-checksum-send=-1,now-ts=946685751677612]
2000-01-01 00:15:51 946685751.688462 [message] [tgen-stream.c:1633] [_tgenstream_log] [stream-success] transport [fd=2468,local=localhost:127.0.0.1:80,proxy=NULL:0.0.0.0:0,remote=localhost:127.0.0.1:41346,state=SUCCESS_OPEN,error=NONE] stream [id=11426,vertexid=passive-stream:traffic,name=server77onionservice,peername=markovclient835onionservice,sendsize=0,recvsize=0,sendstate=SEND_SUCCESS,recvstate=RECV_SUCCESS,error=NONE] bytes [total-bytes-recv=14610,total-bytes-send=631042,payload-bytes-recv=14340,payload-bytes-send=630960,payload-progress-recv=100.00%,payload-progress-send=100.00%] times [created-ts=946685748865026,usecs-to-socket-create=0,usecs-to-socket-connect=0,usecs-to-proxy-init=-1,usecs-to-proxy-choice=-1,usecs-to-proxy-request=-1,usecs-to-proxy-response=-1,usecs-to-command=-1,usecs-to-response=654000,usecs-to-first-byte-recv=1061436,usecs-to-last-byte-recv=2823436,usecs-to-checksum-recv=-1,usecs-to-first-byte-send=654000,usecs-to-last-byte-send=2807700,usecs-to-checksum-send=-1,now-ts=946685751688462]
Bail out! ERROR:/builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-stream.c:1449:_tgenstream_onWritable: assertion failed: (_tgenstream_getTime(stream) >= stream->send.deferBarrierMicros)

stderr:

ERROR:/builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-stream.c:1449:_tgenstream_onWritable: assertion failed: (_tgenstream_getTime(stream) >= stream->send.deferBarrierMicros)

Full stdout:
server77onionservice.tgen.1000.stdout.gz

I tried running shadow's minimal tor test with different seeds to see if I could get it to exercise this bug. No luck after 450 seeds. Some of them fail in the post-analysis step due to failed transfers, but none cause this tgen crash.

I'm trying another large run with strace-logging enabled, and tgen debug-level logging. https://gitlab.torproject.org/jnewsome/sponsor-61-sims/-/pipelines/43500

strace-logging used all of our storage. Enabling debug-level logging on just the tgen servers also used all of our storage. I could try only enabling it on the onion service servers, or even just one of them, but I haven't yet.

I tried copying the core and tgen binaries off the server and inspecting them locally, but for some reason haven't been able to get that to work properly. gdb complains about some of the libraries not being present, which is expected, but also seems to fail to symbolize tgen stack frames.

I have been able to start a gdb session in a still-running simulation. I also built tgen with -O0.

Here's the stack:

#0  0x00007f039c5cf812 in shim_native_syscallv (n=n@entry=234, 
    args=args@entry=0x7f0391675570)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_syscall.c:94
#1  0x00007f039c5cffd4 in shim_syscallv (n=234, args=args@entry=0x7f0391675570)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_syscall.c:280
#2  0x00007f039c5d00d1 in shim_syscall (n=<optimized out>)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_syscall.c:289
#3  0x00007f039c5ce6f9 in _shim_seccomp_handle_sigsys (sig=<optimized out>, 
    info=<optimized out>, voidUcontext=0x7f0391675680)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_seccomp.c:63
#4  <signal handler called>
#5  0x00007f039c61cfe9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f039c5ceafd in _die_with_fatal_signal (signo=signo@entry=6)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_signals.c:26
#7  0x00007f039c5cef24 in shim_process_signals (host_lock=<optimized out>, 
    ucontext=ucontext@entry=0x0)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_signals.c:57
--Type <RET> for more, q to quit, c to continue without paging--
#8  0x00007f039c5cfcd7 in _shim_emulated_syscall_event (
    syscall_event=0x7f0391676e60)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_syscall.c:154
#9  shim_emulated_syscallv (n=n@entry=14, args=args@entry=0x7f0391677270)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_syscall.c:244
#10 0x00007f039c5d0007 in shim_syscallv (n=14, args=args@entry=0x7f0391677270)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_syscall.c:274
#11 0x00007f039c5d00d1 in shim_syscall (n=<optimized out>)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_syscall.c:289
#12 0x00007f039c5ce6f9 in _shim_seccomp_handle_sigsys (sig=<optimized out>, 
    info=<optimized out>, voidUcontext=0x7f0391677380)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/shadow/src/lib/shim/shim_seccomp.c:63
#13 <signal handler called>
#14 0x00007f039c61d00b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x00007f039c5fc859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x00007f039c7e9b43 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#17 0x00007f039c846cef in g_assertion_message_expr ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x000056039425ae28 in _tgenstream_onWritable (stream=0x5603964c7f50)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-stream.c:1447
#19 0x000056039425be06 in _tgenstream_runStreamEventLoop (
    stream=0x5603964c7f50, events=TGEN_EVENT_WRITE)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-stream.c:1802
#20 0x000056039425bffa in tgenstream_onEvent (stream=0x5603964c7f50, 
    descriptor=1703, events=TGEN_EVENT_WRITE)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-stream.c:1854
#21 0x000056039424e279 in _tgenio_helper (io=0x5603958e70f0, 
    child=0x560396f2ae80, in=0, out=1, done=0)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-io.c:324
#22 0x000056039424e5c0 in tgenio_loopOnce (io=0x5603958e70f0, maxEvents=100)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-io.c:398
#23 0x000056039424606c in tgendriver_activateIO (driver=0x5603958e7110)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-driver.c:601
#24 0x000056039424f366 in _tgenmain_run (argc=2, argv=0x7ffc8a0ac0d8)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-main.c:154
#25 0x000056039424f453 in main (argc=2, argv=0x7ffc8a0ac0d8)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-main.c:177

I was able to verify that currentEvents on the object is 5 (EPOLLOUT|EPOLLIN). I believe this means we really were listening for the descriptor to be writable before the timeout expired, which is incorrect.

(gdb) frame 22
#22 0x000056039424e5c0 in tgenio_loopOnce (io=0x5603958e70f0, maxEvents=100)
    at /builds/jnewsome/sponsor-61-sims/jobs/src/tgen/src/tgen-io.c:398
398                 _tgenio_helper(io, child, in, out, done);
(gdb) print epevs[i]
$1 = {events = 4, data = {ptr = 0x6a7, fd = 1703, u32 = 1703, u64 = 1703}}
(gdb) print *child
$2 = {descriptor = 1703, currentEvents = 5, deferWriteTimer = 0x560396e399f0, 
  refcount = 2, magic = 2882395322, 
  notify = 0x56039425bf4d <tgenstream_onEvent>, 
  checkTimeout = 0x56039425c16e <tgenstream_onCheckTimeout>, 
  data = 0x5603964c7f50, destructData = 0x56039425cc03 <tgenstream_unref>, 
  io = 0x5603958e70f0}

I think the next step is to add some debugging to the timerfd callback, to verify that it's getting called prematurely and figure out why...

Stumbled across the bug now fixed in shadow/shadow#2279 while debugging this. Thought it might be the root cause, but doesn't appear to be. Still debugging...

Confirmed this is fixed by shadow/shadow#2279 and shadow/shadow#2282. (i.e. was never a tgen bug, but a shadow bug)