Invalid partition being opened during log fetch for instant restart

Question

Invalid partition being opened during log fetch for instant restart

caetanosauer opened this issue 8 years ago · comments

[ This bug was reported via email by Kevin and Min from U Chicago ]

Bug encountered during instant restart after dirty shutdown. Commands executed:

./zapps kits -b tpcc -q 10 -t 8 --duration 60 --sm_bufpoolsize 15000 --sm_shutdown_clean false --load

./zapps kits -b tpcc -q 10 -t 8 --duration 60 --sm_bufpoolsize 15000 --sm_restart_instant true

Stack trace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff7ac973700 (LWP 30988)]
0x00007ffff7bc6414 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007ffff7bc6414 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000059d4aa in __gthread_mutex_lock (__mutex=0x30) at /usr/include/x86_64-linux-gnu/c++/4.8/bits/gthr-default.h:748
#2  lock (this=0x30) at /usr/include/c++/4.8/mutex:134
#3  lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/4.8/mutex:414
#4  partition_t::open_for_read (this=0x0) at /home/cc/zero/src/sm/partition.cpp:311
#5  0x0000000000582fa3 in log_core::fetch (this=0x7ffaec024a30, ll=..., buf=buf@entry=0x7ff38800ba60, nxt=nxt@entry=0x0, 
    forward=forward@entry=true) at /home/cc/zero/src/sm/log_core.cpp:236
#6  0x00000000005ee935 in restart_m::_collect_spr_logs (pid=@0x7ff7ac9714f0: 59349, current_lsn=..., emlsn=..., 
Python Exception <class 'IndexError'> list index out of range: 
    buffer=@0x7ff7ac971500: 0x7ff388000930 "\320\002\"\026\325", <incomplete sequence \347>, lr_offsets=std::list)
    at /home/cc/zero/src/sm/log_spr.cpp:121
#7  0x00000000005eef2b in restart_m::recover_single_page (p=..., emlsn=...) at /home/cc/zero/src/sm/log_spr.cpp:86
#8  0x00000000005bffed in vol_t::read_page_verify (this=0x7ff75f0d9720, pnum=pnum@entry=59349, buf=0x7ff3f7088000, emlsn=...)
    at /home/cc/zero/src/sm/vol.cpp:462
#9  0x00000000005162e7 in bf_tree_m::fix (this=0x7ff75dfad690, parent=parent@entry=0x0, page=@0x7ff7ac971898: 0x7ff3f7088000, pid=59349, 
    mode=mode@entry=LATCH_SH, conditional=false, virgin_page=false, only_if_hit=false, emlsn=...) at /home/cc/zero/src/sm/bf_tree.cpp:319
#10 0x0000000000516bb0 in bf_tree_m::fix_nonroot (this=<optimized out>, page=@0x7ff7ac971898: 0x7ff3f7088000, parent=parent@entry=0x0, 
    pid=<optimized out>, mode=mode@entry=LATCH_SH, conditional=conditional@entry=false, virgin_page=virgin_page@entry=false, 
    only_if_hit=only_if_hit@entry=false, emlsn=...) at /home/cc/zero/src/sm/bf_tree.cpp:805
#11 0x00000000005f420d in restart_m::redo_page_pass (this=0x7ffaec02e760) at /home/cc/zero/src/sm/restart.cpp:386
#12 0x00000000005f551c in restart_thread_t::run (this=<optimized out>) at /home/cc/zero/src/sm/restart.cpp:566
#13 0x00000000006132b4 in sthread_t::_start (this=0x7ff7504c2f10) at /home/cc/zero/src/common/sthread.cpp:732
#14 0x0000000000613558 in sthread_t::__start (arg=<optimized out>) at /home/cc/zero/src/common/sthread.cpp:658
#15 0x000000000061ea84 in pthread_core_start (_arg=<optimized out>) at /home/cc/zero/src/common/sthread_core_pthread.cpp:127
#16 0x00007ffff7bc4182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#17 0x00007ffff653a47d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Caetano Sauer · Answer 1 · Fri Jan 27 2017 01:25:04 GMT+0800 (China Standard Time)

Contents of teh fetched LSN, sent by Min:

(gdb) f 5
#5  0x0000000000582fa3 in log_core::fetch (this=0x7ffaec024a30, ll=..., buf=buf@entry=0x7ff3a0001bc8, nxt=nxt@entry=0x0, forward=forward@entry=true) at /home/cc/zero/src/sm/log_core.cpp:236
236	    W_DO(p->open_for_read());
(gdb) show scheduler-locking
Mode for locking scheduler during execution is "on".
(gdb) p ll.str()
$1 = "2.298551816"

Caetano Sauer · Answer 2 · Fri Jan 27 2017 01:34:36 GMT+0800 (China Standard Time)

From what I can gather so far, the p pointer is null, which should only happen when a given log partition is not found. However, your ls command shows that the file log.2 exists, so the get_partition() method should not return a null pointer.

I'm still trying to reproduce this. Otherwise it will be very difficult to debug it.

Caetano Sauer · Answer 3 · Fri Jan 27 2017 01:57:29 GMT+0800 (China Standard Time)

Hi guys, I just noticed one thing. You've got 24 log files generated for a workload of 60 seconds with SF 10. That seems like a lot to me. Are you running this with the log in main memory?

Kevin Hong · Answer 4 · Thu Feb 02 2017 00:40:09 GMT+0800 (China Standard Time)

Yes, the log is in main memory.