Deadlock when issuing concurrent vgcreate/vgremove
gpaul opened this issue · comments
Version information
URL: https://www.sourceware.org/pub/lvm2/LVM2.2.02.183.tgz
sha1sum: c73173d73e2ca17da254883968fbd52a6ce5c2a6
Build steps
export PKG_PATH=/opt/lvm/
./configure --with-confdir=$PKG_PATH/etc --with-default-system-dir=$PKG_PATH/etc/lvm --prefix=$PKG_PATH --sbindir=$PKG_PATH/bin --with-usrsbindir=$PKG_PATH/bin --enable-static_link
make
make install
What were you trying to do?
I remove a volume group using vgremove
while creating a different volume group with different PVs using vgcreate
.
What happened?
The commands hang both hang. It looks like vgcreate
tries to acquire a lock while vgremove
holds it.
Steps to reproduce
$ mkdir -p /var/lib/gpaul
$ dd if=/dev/zero of=/var/lib/gpaul/disk1 count=1024 bs=1M
$ dd if=/dev/zero of=/var/lib/gpaul/disk2 count=1024 bs=1M
$ dd if=/dev/zero of=/var/lib/gpaul/disk3 count=1024 bs=1M
$ losetup -f /var/lib/gpaul/disk1
$ losetup -f /var/lib/gpaul/disk2
$ losetup -f /var/lib/gpaul/disk3
$ losetup -a
/dev/loop0: [51713]:41951014 (/var/lib/gpaul/disk1)
/dev/loop1: [51713]:41951015 (/var/lib/gpaul/disk2)
/dev/loop2: [51713]:41951016 (/var/lib/gpaul/disk3)
$ pvcreate /dev/loop0
$ pvcreate /dev/loop1
$ pvcreate /dev/loop2
$ vgcreate gpaul-vg-1 /dev/loop0
$ vgremove --config="log {level=7 verbose=1}" gpaul-vg-1 & vgcreate --config="log {level=7 verbose=1}" gpaul-vg-2 /dev/loop1 /dev/loop2
[1] 22111
Logging initialised at Thu Aug 8 12:24:47 2019
Logging initialised at Thu Aug 8 12:24:47 2019
Archiving volume group "gpaul-vg-1" metadata (seqno 1).
^C Interrupted...
Giving up waiting for lock.
Can't get lock for gpaul-vg-1
Cannot process volume group gpaul-vg-1
Interrupted...
Interrupted...
Device /dev/loop1 excluded by a filter.
Device /dev/loop2 excluded by a filter.
Removing physical volume "/dev/loop0" from volume group "gpaul-vg-1"
Volume group "gpaul-vg-1" successfully removed
Reloading config files
$ Reloading config files
[1]+ Done vgremove --config="log {level=7 verbose=1}" gpaul-vg-1
$ date
Thu Aug 8 12:25:01 UTC 2019
Note, in the following interleaved logging, process 22112 is vgcreate
, process 22111 is vgremove
.
I'm attaching the interleaved, verbose, debug logs for the processes as sent to journald.
lvm-deadlock.log
Also of note: the lvm.conf I use is different to the one bundled with the RHEL lvm2 rpms, I use a very standard lvm.conf as generated by the ./configure
parameters. I've attached it anyway.
lvm.conf.txt
@tasleson indeed, that looks related. It seems there are still some deadlock issues lurking about.
This looks identical, actually:
...
Aug 08 12:24:47 lvm[22111]: Dropping cache for #orphans.
Aug 08 12:24:47 lvm[22111]: Locking /run/lock/lvm/P_orphans WB
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/P_orphans:aux WB
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/P_orphans WB
...
Here the first process is acquiring /run/lock/lvm/P_orphans
.
...
Aug 08 12:24:47 lvm[22112]: Locking /run/lock/lvm/V_gpaul-vg-1 RB
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-1:aux WB
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/V_gpaul-vg-1:aux
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-1 RB
Aug 08 12:24:57 lvm[22112]: Interrupted...
...
...and the second process tries to acquire the volume group lock.
If we look at _do_flock
and _undo_flock
calls only:
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-2:aux WB
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-2 WB
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/V_gpaul-vg-2:aux
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/V_gpaul-vg-2
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/P_orphans:aux WB
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/P_orphans WB
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/P_orphans:aux
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/V_gpaul-vg-1:aux WB
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/V_gpaul-vg-1 WB
Aug 08 12:24:47 lvm[22111]: _undo_flock /run/lock/lvm/V_gpaul-vg-1:aux
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/P_orphans:aux WB
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/P_orphans WB
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-1:aux WB
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/V_gpaul-vg-1:aux
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-1 RB
... deadlocked, eventually interrupted with ctrl+c ...
Aug 08 12:24:57 lvm[22112]: _undo_flock /run/lock/lvm/P_orphans
Aug 08 12:24:57 lvm[22111]: _undo_flock /run/lock/lvm/P_orphans:aux
Aug 08 12:24:57 lvm[22111]: _undo_flock /run/lock/lvm/P_orphans
Aug 08 12:24:57 lvm[22111]: _undo_flock /run/lock/lvm/V_gpaul-vg-1
Yeah,
It looks like 22112 (vgcreate
) acquires the P_orphans
lock, then the V_gpaul-vg-1
lock.
It looks like 22111 (vgremove
) acquires the V_gpaul-vg-1
lock, then the P_orphans
lock.
This is a design problem in lvm locking which uses two "global" locks, uses them inconsistently, and in the wrong places. It is fixed in the lvm 2.03 versions by:
https://sourceware.org/git/?p=lvm2.git;a=commit;h=8c87dda195ffadcce1e428d3481e8d01080e2b22