lvmteam / lvm2

Mirror of upstream LVM2 repository

Home Page:https://gitlab.com/lvmteam/lvm2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deadlock when issuing concurrent vgcreate/vgremove

gpaul opened this issue · comments

Version information

URL: https://www.sourceware.org/pub/lvm2/LVM2.2.02.183.tgz
sha1sum: c73173d73e2ca17da254883968fbd52a6ce5c2a6

Build steps

export PKG_PATH=/opt/lvm/

./configure --with-confdir=$PKG_PATH/etc --with-default-system-dir=$PKG_PATH/etc/lvm --prefix=$PKG_PATH --sbindir=$PKG_PATH/bin --with-usrsbindir=$PKG_PATH/bin --enable-static_link
make
make install

What were you trying to do?

I remove a volume group using vgremove while creating a different volume group with different PVs using vgcreate.

What happened?

The commands hang both hang. It looks like vgcreate tries to acquire a lock while vgremove holds it.

Steps to reproduce

$ mkdir -p /var/lib/gpaul
$ dd if=/dev/zero of=/var/lib/gpaul/disk1 count=1024 bs=1M
$ dd if=/dev/zero of=/var/lib/gpaul/disk2 count=1024 bs=1M
$ dd if=/dev/zero of=/var/lib/gpaul/disk3 count=1024 bs=1M

$ losetup -f /var/lib/gpaul/disk1
$ losetup -f /var/lib/gpaul/disk2
$ losetup -f /var/lib/gpaul/disk3

$ losetup -a
/dev/loop0: [51713]:41951014 (/var/lib/gpaul/disk1)
/dev/loop1: [51713]:41951015 (/var/lib/gpaul/disk2)
/dev/loop2: [51713]:41951016 (/var/lib/gpaul/disk3)

$ pvcreate /dev/loop0
$ pvcreate /dev/loop1
$ pvcreate /dev/loop2

$ vgcreate gpaul-vg-1 /dev/loop0
$ vgremove --config="log {level=7 verbose=1}" gpaul-vg-1 & vgcreate --config="log {level=7 verbose=1}" gpaul-vg-2 /dev/loop1 /dev/loop2
[1] 22111
    Logging initialised at Thu Aug  8 12:24:47 2019
    Logging initialised at Thu Aug  8 12:24:47 2019
    Archiving volume group "gpaul-vg-1" metadata (seqno 1).
^C  Interrupted...
  Giving up waiting for lock.
  Can't get lock for gpaul-vg-1
  Cannot process volume group gpaul-vg-1
  Interrupted...
  Interrupted...
  Device /dev/loop1 excluded by a filter.
  Device /dev/loop2 excluded by a filter.
    Removing physical volume "/dev/loop0" from volume group "gpaul-vg-1"
  Volume group "gpaul-vg-1" successfully removed
    Reloading config files
$     Reloading config files

[1]+  Done                    vgremove --config="log {level=7 verbose=1}" gpaul-vg-1
$ date
Thu Aug  8 12:25:01 UTC 2019

Note, in the following interleaved logging, process 22112 is vgcreate, process 22111 is vgremove.

I'm attaching the interleaved, verbose, debug logs for the processes as sent to journald.
lvm-deadlock.log

Also of note: the lvm.conf I use is different to the one bundled with the RHEL lvm2 rpms, I use a very standard lvm.conf as generated by the ./configure parameters. I've attached it anyway.
lvm.conf.txt

@tasleson indeed, that looks related. It seems there are still some deadlock issues lurking about.

This looks identical, actually:

...
Aug 08 12:24:47 lvm[22111]: Dropping cache for #orphans.
Aug 08 12:24:47 lvm[22111]: Locking /run/lock/lvm/P_orphans WB
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/P_orphans:aux WB
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/P_orphans WB
...

Here the first process is acquiring /run/lock/lvm/P_orphans.

...
Aug 08 12:24:47 lvm[22112]: Locking /run/lock/lvm/V_gpaul-vg-1 RB
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-1:aux WB
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/V_gpaul-vg-1:aux
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-1 RB
Aug 08 12:24:57 lvm[22112]: Interrupted...
...

...and the second process tries to acquire the volume group lock.

If we look at _do_flock and _undo_flock calls only:

Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-2:aux WB
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-2 WB
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/V_gpaul-vg-2:aux
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/V_gpaul-vg-2
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/P_orphans:aux WB
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/P_orphans WB
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/P_orphans:aux
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/V_gpaul-vg-1:aux WB
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/V_gpaul-vg-1 WB
Aug 08 12:24:47 lvm[22111]: _undo_flock /run/lock/lvm/V_gpaul-vg-1:aux
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/P_orphans:aux WB
Aug 08 12:24:47 lvm[22111]: _do_flock /run/lock/lvm/P_orphans WB
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-1:aux WB
Aug 08 12:24:47 lvm[22112]: _undo_flock /run/lock/lvm/V_gpaul-vg-1:aux
Aug 08 12:24:47 lvm[22112]: _do_flock /run/lock/lvm/V_gpaul-vg-1 RB
... deadlocked, eventually interrupted with ctrl+c ...
Aug 08 12:24:57 lvm[22112]: _undo_flock /run/lock/lvm/P_orphans
Aug 08 12:24:57 lvm[22111]: _undo_flock /run/lock/lvm/P_orphans:aux
Aug 08 12:24:57 lvm[22111]: _undo_flock /run/lock/lvm/P_orphans
Aug 08 12:24:57 lvm[22111]: _undo_flock /run/lock/lvm/V_gpaul-vg-1

Yeah,
It looks like 22112 (vgcreate) acquires the P_orphans lock, then the V_gpaul-vg-1 lock.
It looks like 22111 (vgremove) acquires the V_gpaul-vg-1 lock, then the P_orphans lock.

This is a design problem in lvm locking which uses two "global" locks, uses them inconsistently, and in the wrong places. It is fixed in the lvm 2.03 versions by:
https://sourceware.org/git/?p=lvm2.git;a=commit;h=8c87dda195ffadcce1e428d3481e8d01080e2b22