cv_signal(&arc_abd_move_thr_cv) being called after cv_destroy(&arc_abd_move_thr_cv);
arun-kv opened this issue · comments
In arc_fini we signal the arc_abd_move_thread to exit first, and then destroy arc_abd_move_thr_cv.
https://github.com/openzfsonwindows/ZFSin/blob/master/ZFSin/zfs/module/zfs/arc.c#L7837
and then we signal the arc_reclaim_thread, which further tries to signal the arc_abd_move_thr_cv which is already destroyed.
https://github.com/openzfsonwindows/ZFSin/blob/master/ZFSin/zfs/module/zfs/arc.c#L5183
This leads to occasional panic during uninstallation.
00 ffffd901`0d02f978 fffff800`ecf75f07 nt!KeBugCheckEx
01 ffffd901`0d02f980 fffff800`ecf54dcf nt!PspSystemThreadStartup$filt$0+0x44
02 ffffd901`0d02f9c0 fffff800`ecf6bb8d nt!_C_specific_handler+0x9f
03 ffffd901`0d02fa30 fffff800`ecebde91 nt!RtlpExecuteHandlerForException+0xd
04 ffffd901`0d02fa60 fffff800`ecebcc07 nt!RtlDispatchException+0x421
05 ffffd901`0d030160 fffff800`ecf70a0e nt!KiDispatchException+0x1d7
06 ffffd901`0d030820 fffff800`ecf6e073 nt!KiExceptionDispatch+0xce
07 ffffd901`0d030a00 fffff807`d842222e nt!KiBreakpointTrap+0xf3
08 ffffd901`0d030b90 fffff807`d847f082 ZFSin!spl_cv_signal+0xe [C:\BuildAgent\work\88cd52027cd63d70\ZFSin\spl\module\spl\spl-condvar.c @ 72]
09 ffffd901`0d030bc0 fffff800`ececb595 ZFSin!arc_reclaim_thread+0x422 [C:\BuildAgent\work\88cd52027cd63d70\ZFSin\zfs\module\zfs\arc.c @ 5185]
0a ffffd901`0d030c10 fffff800`ecf6ac56 nt!PspSystemThreadStartup+0x41
0b ffffd901`0d030c60 00000000`00000000 nt!KiStartSystemThread+0x16
@lundman We may have to consider splitting arc_abd_move_thr_fini()
https://github.com/openzfsonwindows/ZFSin/blob/master/ZFSin/zfs/module/zfs/arc.c#L9412
into two, where we do
mutex_enter(&arc_abd_move_thr_lock);
cv_signal(&arc_abd_move_thr_cv);
arc_abd_move_thr_exit = 1;
while (arc_abd_move_thr_exit != 0)
cv_wait(&arc_abd_move_thr_cv, &arc_abd_move_thr_lock);
mutex_exit(&arc_abd_move_thr_lock);
in the first and
mutex_destroy(&arc_abd_move_thr_lock);
cv_destroy(&arc_abd_move_thr_cv);
in the second.
The second part should be delayed till the "end" of driver unload. This allows other threads to inspect arc_abd_move_thr_exit and lock/unlock the synchronization primitive protecting it.
To be honest, the abd_move work came from osx, where I have already removed it - when merging with the new port. It was decided that if the need comes up again, we'll re-implement it, since the way abd is setup is a little different. At this point, I'm inclined to remove from ZFSin as well.
Nice catch though!
Thanks @lundman. We will take this change in our environment and see how it goes.