Fabrics connect for large number of connections too slow with v2.8

Question

Fabrics connect for large number of connections too slow with v2.8

sukhi61166 opened this issue 4 months ago · comments

Fabrics connect over TCP/RoCEv2 for 64 admin connections with 7 io conns each is excessively slow as compared to nvme-cli v1.16.

Connections were created using nvme connect once for each admin connection. Unique host-id, host-nqn and target-nqn values were used for each subsequent admin connect. As an example, two of the several commands are listed below:
for i in {1..8}
do
sudo nvme connect -t rdma -s 4420 -a $tgt1_ip -i 7 -n nqn.org:nvme.1 -I $host_id$i -q $host_nqn$i
sudo nvme connect -t rdma -s 4420 -a $tgt1_ip -i 7 -n nqn.org:nvme.2 -I $host_id$i -q $host_nqn$i
.....
done

Connect timings:
v2.8 : 51m 7.214s
v2.2.1 : 21m 59.016s
v1.16 : 1m 51.517s

As can be seen by the data collected on the same machine with various nvme-cli versions, there seems to be significant downward trend in connect performance between v1.16 and v2.8.

Daniel Wagner · Answer 1 · Tue Mar 05 2024 17:11:54 GMT+0800 (China Standard Time)

Is this is with the same kernel?

Between the 1.x and the 2.x branch, there is a significant change how the topology is discovered. The 2.x is crawling through the sysfs to figure out the topology. But that should evens be in the range of a few seconds at max.

v2.2.1 is still issuing a few commands to figure out the topology while v2.8 is only using the sysfs for discovery.

Given this, it doesn't look like the large delay is introduced because of this.

Could you past the output of one connect command but this time with verbose level enabled, e.g.

sudo nvme connect -t rdma -s 4420 -a $tgt1_ip -i 7 -n nqn.org:nvme.1 -I $host_id$i -q $host_nqn$i -vvv

sukhi61166 · Answer 2 · Wed Mar 06 2024 06:48:44 GMT+0800 (China Standard Time)

Yes, all experiments done on the same OS with same kernel as listed below:
Red Hat Enterprise Linux 9.2 (Plow)
5.14.0-284.30.1.el9_2.x86_64

Below is the output of the connect command in verbose mode with nvme-cli v2.8. The verbose mode is not supported with v1.16 so I could not get that output.

sudo ./nvme connect -t rdma -s 4420 -a 10.10.10.10 -i 7 -n nqn..org:nvme.1 -I f0a54208-c6fd-4f15-bee0-92a790b7d192 -q nqn.2014-08.org.nvmexpress:uuid:a9ff9600-3ac5-11eb-8000-3cecef1fc322 -vvv
warning: use hostid which does not match uuid in hostnqn
kernel supports: instance cntlid transport traddr trsvcid nqn queue_size nr_io_queues reconnect_delay ctrl_loss_tmo keep_alive_tmo hostnqn host_traddr host_iface hostid duplicate_connect disable_sqflow hdr_digest data_digest nr_write_queues nr_poll_queues tos fast_io_fail_tmo discovery dhchap_secret dhchap_ctrl_secret
option "keyring" ignored
connect ctrl, 'nqn=nqn.org:nvme.1,transport=rdma,traddr=10.10.10.10,trsvcid=4420,hostnqn=nqn.2014-08.org.nvmexpress:uuid:a9ff9600-3ac5-11eb-8000-3cecef1fc322,hostid=f0a54208-c6fd-4f15-bee0-92a790b7d192,nr_io_queues=7,ctrl_loss_tmo=600'
connect ctrl, response 'instance=0,cntlid=1'
nvme0: nqn.org:nvme.1 connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme0
connecting to device: nvme0

Daniel Wagner · Answer 3 · Wed Mar 06 2024 15:58:00 GMT+0800 (China Standard Time)

The log shows that we just issue one connect command from user land to kernel, exactly as it should do.
Does the kernel log anything?

Anyway, I'll try to reproduce this locally with nvme-tcp. Don't think the transport really matters. Unless @maurizio-lombardi has a test system up and running to check if there is something funny going on in the RH kernel.

Maurizio Lombardi · Answer 4 · Wed Mar 06 2024 21:01:47 GMT+0800 (China Standard Time)

@igaw I will have ask our QA team if they have a system available to test this. However the RHEL9.2 kernel should not be much different from the mainline one (relative to the time it has been released).
I would like to see the dmesg if it gives some hints.

@sukhi61166 is this a test system where you can eventually test different kernels or it's not possible?

sukhi61166 · Answer 5 · Thu Mar 07 2024 07:56:07 GMT+0800 (China Standard Time)

@maurizio-lombardi and @igaw
There is nothing out of ordinary in the dmesg. Please see the output below.

We can try another kernel if needed. But the original testing was all done by the same kernel using different versions of nvme-cli.

time sudo .build/nvme connect -t rdma -s 4420 -a 10.10.10.10 -i 7 -n nqn.org:nvme.1 -I f0a54208-c6fd-4f15-bee0-92a790b7d192 -q nqn.2014-08.org.nvmexpress:uuid:a9ff9600-3ac5-11eb-8000-3cecef1fc322 -vvv
warning: use hostid which does not match uuid in hostnqn
kernel supports: instance cntlid transport traddr trsvcid nqn queue_size nr_io_queues reconnect_delay ctrl_loss_tmo keep_alive_tmo hostnqn host_traddr host_iface hostid duplicate_connect disable_sqflow hdr_digest data_digest nr_write_queues nr_poll_queues tos fast_io_fail_tmo discovery dhchap_secret dhchap_ctrl_secret
option "keyring" ignored
connect ctrl, 'nqn=nqn.org:nvme.1,transport=rdma,traddr=10.10.10.10,trsvcid=4420,hostnqn=nqn.2014-08.org.nvmexpress:uuid:a9ff9600-3ac5-11eb-8000-3cecef1fc322,hostid=f0a54208-c6fd-4f15-bee0-92a790b7d192,nr_io_queues=7,ctrl_loss_tmo=600'
connect ctrl, response 'instance=0,cntlid=1'
nvme0: nqn.org:nvme.1 connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme0
connecting to device: nvme0

real    0m0.325s
user    0m0.004s
sys     0m0.038s

----------------dmesg -------------

[Wed Mar  6 15:52:01 2024] nvme nvme0: queue_size 128 > ctrl sqsize 8, clamping down
[Wed Mar  6 15:52:01 2024] nvme nvme0: creating 7 I/O queues.
[Wed Mar  6 15:52:01 2024] nvme nvme0: mapped 7/0/0 default/read/poll queues.
[Wed Mar  6 15:52:01 2024] nvme nvme0: new ctrl: NQN "nqn.org:nvme.1", addr 10.10.10.10:4420

Daniel Wagner · Answer 6 · Thu Mar 07 2024 15:33:49 GMT+0800 (China Standard Time)

There is a warning, but I don't think it's the reason for the long delay

warning: use hostid which does not match uuid in hostnqn

And if you do the next connect, does it take any longer?

I quickly looked at the 1.x sources and the connect code itself hasn't changed
that much. The main difference is the 'topology scan'. So maybe the scan code is getting confused by the hierarchy and iterates over and over. Could you setup your system again and try to gather the sysfs tree by running the collect-sysfs.sh script from the libnvme sources?

https://github.com/linux-nvme/libnvme/blob/master/scripts/collect-sysfs.sh

Maurizio Lombardi · Answer 7 · Thu Mar 07 2024 17:55:58 GMT+0800 (China Standard Time)

@sukhi61166

We can try another kernel if needed. But the original testing was all done by the same kernel using different versions of nvme-cli.

Yes I understand, but when submitting a kernel bug report to the mailing list, the developers usually want to verify if the bug is reproducible with the very latest kernel version, excluding any custom change that a particular distro (in this case RHEL) may have introduced in their own kernel

Maurizio Lombardi · Answer 8 · Thu Mar 07 2024 17:59:39 GMT+0800 (China Standard Time)

There is a warning, but I don't think it's the reason for the long delay
warning: use hostid which does not match uuid in hostnqn

This is due to a change introduced in nvme-cli (2.6 I think, "1:1 mapping between hostnqn and hostid") that isn't compatible with the RHEL9 kernel. Shouldn't be the root cause of the slowdown.

sukhi61166 · Answer 9 · Fri Mar 08 2024 09:57:52 GMT+0800 (China Standard Time)

nvme-sysfs-server1-5.14.0-284.30.1.el9_2.x86_64.gz
dmesg.txt
host-log.txt

sukhi61166 · Answer 10 · Fri Mar 08 2024 10:02:36 GMT+0800 (China Standard Time)

@igaw
Please see attached files as you requested. I connected to 2 drives 2 times each. I saved the dmesg and the verbose output of the nvme connect command in separate files.
You are right, it is the scanning that is adding up the delay. First few connections are faster. As more and more devices add, the scanning takes longer for the subsequent connects. Both dmesg.txt and host.txt show that.

Daniel Wagner · Answer 11 · Fri Mar 08 2024 22:45:40 GMT+0800 (China Standard Time)

Darn, the script didn't really work for the RH kernel. Most of the interesting stuff is missing. But I think we know now what the problem is and I should be able to reproduce this.

sukhi61166 · Answer 12 · Sat Mar 09 2024 02:55:27 GMT+0800 (China Standard Time)

Yes I noticed that the information did not seem much useful. I am glad you have enough information to debug it. Please keep me posted about the progress.
Let me know if you need anything else from my end.
Thanks

ksingh-ospo · Answer 13 · Fri Mar 15 2024 01:29:12 GMT+0800 (China Standard Time)

Darn, the script didn't really work for the RH kernel. Most of the interesting stuff is missing. But I think we know now what the problem is and I should be able to reproduce this.

@igaw, if you need any specific sysfs info please list it here. We can pull it manually for you.

sukhi61166 · Answer 14 · Tue Mar 19 2024 01:51:43 GMT+0800 (China Standard Time)

@igaw
Please keep me posted with the progress and let me know if you need anything from my end.
Thanks

Daniel Wagner · Answer 15 · Tue Mar 19 2024 23:38:01 GMT+0800 (China Standard Time)

I'm trying to reproduce it with the commands from the initial report

#!/bin/bash

ip=192.168.154.145

for i in {1..100}; do
        uuid=$(uuid)
        time nvme connect -t tcp -s 4420 -a $ip -i 7 -n nqn.io-0 -I $uuid -q nqn.2014-08.org.nvmexpress:uuid:$uuid
        time nvme connect -t tcp -s 4420 -a $ip -i 7 -n nqn.io-1 -I $uuid -q nqn.2014-08.org.nvmexpress:uuid:$uuid
done

connecting to device: nvme2

real    0m0.063s
user    0m0.000s
sys     0m0.021s
connecting to device: nvme3

real    0m0.070s
user    0m0.000s
sys     0m0.026s
connecting to device: nvme4

real    0m0.071s
user    0m0.000s
sys     0m0.031s

[...]

connecting to device: nvme201

real    0m0.571s
user    0m0.008s
sys     0m0.515s

As expected, after each connect commands, next one takes a bit longer. But after 200 connects it still under one second and this is in a VM and the kernel is a debug kernel with KASAN enabled. Not really sure how to reproduce this.

Daniel Wagner · Answer 16 · Wed Mar 20 2024 00:24:10 GMT+0800 (China Standard Time)

I've counted the amount of openat syscalls when over 200 nvme blkdev have been created. It's roughly 8200, which gives around 40 openat per new block device. This seems reasonable. So I don't see excessive work here.

17:19:03 openat(AT_FDCWD, "/sys/class/nvme-subsystem", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme-subsystem/nvme-subsys2/model", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme-subsystem/nvme-subsys2/serial", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme-subsystem/nvme-subsys2/firmware_rev", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme-subsystem/nvme-subsys2/subsystype", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme-subsystem/nvme-subsys2/iopolicy", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/transport", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/address", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/firmware_rev", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/model", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/state", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/numa_node", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/queue_count", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/serial", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/sqsize", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/dhchap_secret", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/dhchap_ctrl_secret", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/tls_key", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/cntrltype", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/dctype", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/bus/pci/slots", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/nvme2c80n1/ana_state", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme80/nvme2c80n1/ana_grpid", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme81/hostnqn", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme81/hostid", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme81/dhchap_secret", O_RDONLY) = 3
17:19:03 openat(AT_FDCWD, "/sys/class/nvme/nvme81/subsysnqn", O_RDONLY) = 3

Could you collect the strace output of a connect when it starts to get longer:

strace -t -o /tmp/trace.out nvme connect[...]

Maybe we see something there.

sukhi61166 · Answer 17 · Thu Mar 21 2024 13:13:10 GMT+0800 (China Standard Time)

@igaw
Please see attached completion of all 64 admin and 448 io connections. I used -vvv option so you can see the output. Since yours is virtual environment, and ours are actual nvme SSDs connecting over the fabric, timings are significantly different
Which OS and what Kernel version are you trying with?
dmesg.txt
nvme-conn.txt
trace.txt

sukhi61166 · Answer 18 · Thu Mar 21 2024 13:18:57 GMT+0800 (China Standard Time)

I am recapturing the the strace file by timing each connection and removing the verbose option -vvv. I will update those files as soon as all connections are done.
Git portal doesn't allow .out file to be uploaded so i have renamed that to .txt

Below are the files with each connection timed and without verbose to reduce the extra load
dmesg.txt
nvme_conns.txt
strace.txt

Hannes Reinecke · Answer 19 · Thu Mar 21 2024 15:04:24 GMT+0800 (China Standard Time)

Can you check if this helps?

diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 09fcaa519e5b..d3d8e6806432 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -147,8 +147,6 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
         * we have no UUID set
         */
        if (uuid_is_null(&ids->uuid)) {
-               dev_warn_once(dev,
-                       "No UUID available providing old NGUID\n");
                return sysfs_emit(buf, "%pU\n", ids->nguid);
        }
        return sysfs_emit(buf, "%pU\n", &ids->uuid);

Hannes Reinecke · Answer 20 · Thu Mar 21 2024 15:04:40 GMT+0800 (China Standard Time)

That's a kernel patch, mind :-)

Daniel Wagner · Answer 21 · Thu Mar 21 2024 15:50:09 GMT+0800 (China Standard Time)

Which OS and what Kernel version are you trying with?

I'm using Tumbleweed as OS but role my own kernel and nvme-cli. My test setup was using nvme-tcp with a remote VM. As this simple test shows, nvme-cli is not really slowing down on its own.

The 'No UUID' message indicates that something is very busy trying to discover the newly added disks. It's very likely systemd-udevd. IIRC, we had some bug reports for SLES where systemd-udevd went a bit overboard with the discovering and slowed the system considerable down. This has been fixed in upstream systemd.

# udevadm control --log-level=debug
# journalctl -f -u systemd-udevd.service

should show what systemd-udevd is doing. You might first just check if the systemd-udevd process has a very high load.

But let's also see the strace so we can be sure we are looking at the right thing.

Hannes Reinecke · Answer 22 · Thu Mar 21 2024 16:20:38 GMT+0800 (China Standard Time)

Hmm. The 'No UUID' message really just means that the target is not providing an UUID; this is perfectly legit and in fact quite some Vendors do it. So that in itself isn't an indicator that things are busy.

Tomáš Bžatek · Answer 23 · Thu Mar 21 2024 21:55:32 GMT+0800 (China Standard Time)

The 'No UUID' message indicates that something is very busy trying to discover the newly added disks. It's very likely systemd-udevd. IIRC, we had some bug reports for SLES where systemd-udevd went a bit overboard with the discovering and slowed the system considerable down. This has been fixed in upstream systemd.

As a side note and to add more fuel into the fire, UDisks2 is now also reacting to uevents from nvme devices, immediately sending some id-ctrl/id-ns + get_log_page for health information, etc.

Red Hat Enterprise Linux 9.2 (Plow)
5.14.0-284.30.1.el9_2.x86_64

The above however doesn't apply to RHEL 9.x that ships old UDisks. Still, it might be wise to temporarily stop and mask the udisks2.service as it still probes generic block devices and looks for filesystems.

should show what systemd-udevd is doing. You might first just check if the systemd-udevd process has a very high load.

...or check udevadm monitor for any excessive traffic.

sukhi61166 · Answer 24 · Fri Mar 22 2024 07:44:29 GMT+0800 (China Standard Time)

I am a little confused. If there is any action items for me please put my name and give me instructions on what I need to provide. The above discussion and few commands exchange, I am not sure if those are just discussions or anything for me to try.
Thanks Suki

Daniel Wagner · Answer 25 · Fri Mar 22 2024 16:16:58 GMT+0800 (China Standard Time)

@sukhi61166

# udevadm control --log-level=debug
# journalctl -f -u systemd-udevd.service

should show what systemd-udevd is doing. You might first just check if the systemd-udevd process has a very high load.
But let's also see the strace so we can be sure we are looking at the right thing.

...or check udevadm monitor for any excessive traffic.

Tomáš Bžatek · Answer 26 · Fri Mar 22 2024 20:00:25 GMT+0800 (China Standard Time)

@sukhi61166

systemctl disable udisks2.service --now
systemctl mask udisks2.service

(temporarily)

See also Chapter 12. Managing systemd

sekar-wdc · Answer 27 · Fri Mar 22 2024 21:10:06 GMT+0800 (China Standard Time)

I am a bit confused by this thread. The issue we are seeing occurs with nvme cli versions 2.2.1 and after. It does not occur with nvme cli version 1.16 on the same host (RHEL 9.2 running kernel 5.14.0-284.30.1.el9_2.x86_64).

Do you think the behavior of other programs that may or may not be running on the host is impacted by running nvme cli 2.2.1 or after and not nvme cli 1.16 ?

Tomáš Bžatek · Answer 28 · Fri Mar 22 2024 21:48:53 GMT+0800 (China Standard Time)

Do you think the behavior of other programs that may or may not be running on the host is impacted by running nvme cli 2.2.1 or after and not nvme cli 1.16 ?

Unlikely, it's still better to rule out some variables from the equation when debugging. A distro is an ecosystem and e.g. there are more things stacked on top of libnvme than just nvme-cli. As for nvme-cli itself, for example, udev rules may have changed between the versions. There's quite a number of services listening to uevents and doing their own stuff.

sukhi61166 · Answer 29 · Sat Mar 23 2024 06:27:17 GMT+0800 (China Standard Time)

@sukhi61166
# udevadm control --log-level=debug
# journalctl -f -u systemd-udevd.service
should show what systemd-udevd is doing. You might first just check if the systemd-udevd process has a very high load.
But let's also see the strace so we can be sure we are looking at the right thing.

...or check udevadm monitor for any excessive traffic.

@igaw
Please see attached files in the zipped folder with the udev log as per your instructions above. .
Strace_Udev.zip

sukhi61166 · Answer 30 · Sat Mar 23 2024 07:51:46 GMT+0800 (China Standard Time)

@sukhi61166

systemctl disable udisks2.service --now systemctl mask udisks2.service

(temporarily)

See also Chapter 12. Managing systemd

@tbzatek
I captured the same files with the udev2 service disabled. But I don't see the connections being made any faster. It still took 50m55.571s. I changed the udevadm log level to info but then I didn't see any output. So I changed it back to debug for few minutes and changed it back t info again. I wanted to see if extra debugging adds the delay but I its still the same with both experiments.
dmesg.txt
nvme-conn.txt
strace-udisk2.txt
udev_log.txt

sukhi61166 · Answer 31 · Sat Mar 23 2024 07:54:02 GMT+0800 (China Standard Time)

@sukhi61166
# udevadm control --log-level=debug
# journalctl -f -u systemd-udevd.service
should show what systemd-udevd is doing. You might first just check if the systemd-udevd process has a very high load.
But let's also see the strace so we can be sure we are looking at the right thing.

...or check udevadm monitor for any excessive traffic.
@igaw Please see attached files in the zipped folder with the udev log as per your instructions above. . Strace_Udev.zip

Sorry, I just feel like opening the files straight from this link might be easier :)
dmesg.txt
nvme-conn.txt
strace-udev.txt
udev-log.txt

Daniel Wagner · Answer 32 · Mon Mar 25 2024 17:09:45 GMT+0800 (China Standard Time)

Unfortunately, the strace is tracing the shell script not the command itself.

12:49:08 execve("./nvme_conn_roce_64_448.sh", ["./nvme_conn_roce_64_448.sh"], 0x7ffd1bc85ff8 /* 15 vars */) = 0

Just add the strace to the last invokation of the nvme command.

The udev-log.txt containts the same content as strace-udev.txt

sukhi61166 · Answer 33 · Tue Mar 26 2024 08:39:46 GMT+0800 (China Standard Time)

Unfortunately, the strace is tracing the shell script not the command itself.

12:49:08 execve("./nvme_conn_roce_64_448.sh", ["./nvme_conn_roce_64_448.sh"], 0x7ffd1bc85ff8 /* 15 vars */) = 0

Just add the strace to the last invokation of the nvme command.

The udev-log.txt containts the same content as strace-udev.txt

@igaw
I added the strace in each connect command. Please see attached.

3_25_strace_udev.zip

Daniel Wagner · Answer 34 · Tue Mar 26 2024 16:15:00 GMT+0800 (China Standard Time)

The strace logs just trace the sudo and not nvme-cli:

strace -t -o strace_$i.txt sudo .build/nvme connect -t rdma -s 4420 -a $tgt2_ip -i 7 -n nqn.org:nvme.5 -I $host_id$i -q $host_nqn$i # -vv

So I don't see what nvme-cli is doing, the trace contains the syscalls from sudo.

You need to run:

$ su -
# strace -t -o /tmp/trace.out  .build/nvme connect -t rdma -s 4420 -a $tgt2_ip -i 7 -n nqn.org:nvme.5 -I $host_id$i -q $host_nqn$i

sukhi61166 · Answer 35 · Thu Mar 28 2024 05:22:29 GMT+0800 (China Standard Time)

@igaw
I have created separate strace file for each admin connection (64). We also added spreadsheet which show that the openat calls are growing evenly after the initial five connections. I hope this logs set has everything you need.

Uploading 3_26_Strace_udev.zip…

Daniel Wagner · Answer 36 · Thu Mar 28 2024 15:26:44 GMT+0800 (China Standard Time)

The link you posted, points to this issue, not to the file. If it is too big, I think we just need the last 'nvme connect' strace for reviewing.

I couldn't figure out how to get a list of files uploaded to an issue. Sorry.

sukhi61166 · Answer 37 · Fri Mar 29 2024 02:11:25 GMT+0800 (China Standard Time)

3_26_Strace_udev.zip

@igaw
Please unzip the attached folder. I am not sure why the attachment did go through yesterday.

sukhi61166 · Answer 38 · Thu Apr 04 2024 07:32:38 GMT+0800 (China Standard Time)

@igaw
I am just following up to make sure the last attachment I sent has everything you needed. Please let me know if there is any progress on the issue. This is holding us up to use some of the features which are available in the latest versions and not in 1.16.

Daniel Wagner · Answer 39 · Thu Apr 04 2024 14:22:41 GMT+0800 (China Standard Time)

Attachment is fine. I've looked quickly into it and saw that libnvme loops a lot over some parts and rediscovers the same info many times. No time yet to dive into the details. At least we know now where to look.

sukhi61166 · Answer 40 · Fri Apr 05 2024 09:36:35 GMT+0800 (China Standard Time)

@igaw
Thanks for letting me know. Thats all I need to confirm. Please let me know if anything else is needed from my end.
Thanks, Suki

Daniel Wagner · Answer 41 · Thu Apr 11 2024 20:01:12 GMT+0800 (China Standard Time)

$ grep nguid strace8_18.txt | awk '{ print $3; }'
[...]
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n7/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n7/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n7/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n7/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n8/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n8/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n8/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n8/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n8/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n8/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n8/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n9/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n9/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n9/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n9/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n9/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n9/nguid",
"/sys/class/nvme-subsystem/nvme-subsys7/nvme7n9/nguid",
[...]

$ grep nguid strace8_18.txt | awk '{ print $3; }' | sort | wc -l
2016

$ grep nguid strace8_18.txt | awk '{ print $3; }' | sort | uniq | wc -l
256

So the scan algorithm iterates several time over the same sysfs entries. For the last connect, only 256 'nguid' would be necessary but we do over 2000.

One simple trick without touching the algorithm is to make sure we only scan the same entry once by checking first if we found it already (properly should this anyway)

The second step is to figure out if we can change the algorithm without breaking too much.

Daniel Wagner · Answer 42 · Thu Apr 11 2024 21:38:16 GMT+0800 (China Standard Time)

Could you give linux-nvme/libnvme#812 a try and see if it helps? I've tried to fix it according
the logs, because my local setup for testing is not showing this problem.

Daniel Wagner · Answer 43 · Tue Apr 16 2024 22:13:45 GMT+0800 (China Standard Time)

@sukhi61166 did you have a chance to test my patch? After a bit pondering we might even need to go one step further and keep a global list of already discovered controllers.

sukhi61166 · Answer 44 · Wed Apr 17 2024 03:10:13 GMT+0800 (China Standard Time)

@igaw
Yes I tested it and I don't see any significant change. I tried it without disabling the udisk2 service as that did not help before when I tried it. We built the executable using the changes you made in the libnvme. Below is the version it came up with. If possible can you please give me the full package built next time to try? Please see the attached zipped folder with the same data.

real 52m27.815s
nvme version 2.8 (git 2.8-109-gf9f2d58+)
libnvme version 1.8 (git 1.8)

4_12_Debug_cli.zip

Daniel Wagner · Answer 45 · Wed Apr 17 2024 16:05:16 GMT+0800 (China Standard Time)

Thanks for the test and log data. After writing the code, I was always sure, we need a global data structure to avoid the all the rescans. And now we know it's necessary.

sukhi61166 · Answer 46 · Tue Apr 30 2024 00:02:05 GMT+0800 (China Standard Time)

@igaw
I hope all is going well. I just wanted to follow up and see when should we expect a fix or I can test another debug build once you implement the changes as you mentioned above.

sukhi61166 · Answer 47 · Sat May 11 2024 08:45:46 GMT+0800 (China Standard Time)

@igaw , @tbzatek
Just following up if you have any update on the issue or a debug build for me to test. Please let me know.

Tomáš Bžatek · Answer 48 · Mon May 13 2024 20:21:18 GMT+0800 (China Standard Time)

No further ideas from my side.

Daniel Wagner · Answer 49 · Tue May 14 2024 00:53:04 GMT+0800 (China Standard Time)

Sorry pretty busy with a lot of stuff. The current algorithm is clearly doing a something suboptimal (I suspect there is at least a n^2 path). A simple stop solution I see is to use a global cache. There is an obvious problem with this approach is that we need to be careful to also monitor changes. For nvme-cli this isn't a real issue but IIRC nvme-stas uses libnvme in a long running way which makes using caches a bit more difficult.

Martin Belanger · Answer 50 · Tue May 14 2024 02:17:33 GMT+0800 (China Standard Time)

IIRC nvme-stas uses libnvme in a long running way which makes using caches a bit more difficult.

Thanks for considering the impact on nvme-stas 😉. Note, however, that we use pyudev to retrieve nvme-related status info from the sysfs. Since nvme-stas is a Python application, it was easier to use pyudev directly. And as a bonus pyudev comes with event-handling support which allows nvme-stas to register for nvme-related kernel events.

nvme-stas uses libnvme mainly to issue commands. Therefore, we would probably be OK if libnvme had a global cache. Other users/apps may not agree though...

This is definitely a tough problem to solve. 🤯

We almost need a new set of APIs that would allow us to tell libnvme (nvme-cli) that we intend to create a large number of connections. Either we pass the whole list of connection (linked-list or file) to libnvme (nvme-cli) so that it can perform all the connections in a loop w/o having to rescan the whole sysfs between each connection. Or, we need to be able to create a "connection group" and tell libnvme (nvme-cli) when the group can be created. Something like this:

group = nvmeConnectionGroup();
for (int i = 0; i < num_connections; i++) {
    nvmeConnectionGroupAdd(group, ....); /* Add connection to group, but do not connect yet */
}
nvmeConnectionGroupApply(group); /* Make all the connections in the group */

sukhi61166 · Answer 51 · Tue May 14 2024 02:19:36 GMT+0800 (China Standard Time)

@igaw
Thanks for the response. I understand that there are risks involved with each change. If it is possible please implement the change and give me the debug build. I can test it for this issue as well as extensive testing to see if anything else gets broken. Please keep me posted of the progress.
We appreciate your time and efforts.
Thanks & Regards
Suki

ksingh-ospo · Answer 52 · Tue May 14 2024 06:13:17 GMT+0800 (China Standard Time)

IIRC nvme-stas uses libnvme in a long running way which makes using caches a bit more difficult.

Thanks for considering the impact on nvme-stas 😉. Note, however, that we use pyudev to retrieve nvme-related status info from the sysfs. Since nvme-stas is a Python application, it was easier to use pyudev directly. And as a bonus pyudev comes with event-handling support which allows nvme-stas to register for nvme-related kernel events.

nvme-stas uses libnvme mainly to issue commands. Therefore, we would probably be OK if libnvme had a global cache. Other users/apps may not agree though...

This is definitely a tough problem to solve. 🤯

We almost need a new set of APIs that would allow us to tell libnvme (nvme-cli) that we intend to create a large number of connections. Either we pass the whole list of connection (linked-list or file) to libnvme (nvme-cli) so that it can perform all the connections in a loop w/o having to rescan the whole sysfs between each connection. Or, we need to be able to create a "connection group" and tell libnvme (nvme-cli) when the group can be created. Something like this:
group = nvmeConnectionGroup();
for (int i = 0; i < num_connections; i++) {
    nvmeConnectionGroupAdd(group, ....); /* Add connection to group, but do not connect yet */
}
nvmeConnectionGroupApply(group); /* Make all the connections in the group */

Given the challenge of creating a large number of connections (1000s) scenario, this group approach seems like the right way. Assumption here is that the large number of connect responses and AENs resulting from the associated admin cmds like various identify cmds, get/set-features, ns-attach etc would also be handled as a group, before letting udev or other services get involved, as those services may add to the fray.

Daniel Wagner · Answer 53 · Tue May 14 2024 07:15:55 GMT+0800 (China Standard Time)

There is no need for a new C API. We already have everything for doing multiple connects
in one go, e.g. see connect-all.

Martin Belanger · Answer 54 · Tue May 14 2024 21:03:19 GMT+0800 (China Standard Time)

Given the challenge of creating a large number of connections (1000s) scenario, ...

@ksingh-ospo - If you're planning to connect to 1000s of storage subsystems, you should definitely consider using a Discovery Controller (DC) and follow what @igaw said and use nvme connect-all.

Alternatively, you also have the option of using a Centralized Discovery Controller (CDC) as per "TP8010 NVMe-oF Centralized Discovery Controller". The CDC was designed specifically for managing large numbers of Storage subsystems and Host computers. Everything is configured/managed in a central place, the CDC, where you can define Zones. A zone is a group of Hosts and the Storage subsystems they are allowed to connect to. On each host you can either invoke nvme connect-all or use name-stas, which is designed specifically to interact with CDCs and automate configuring connections on a host computer.

Tomáš Bžatek · Answer 55 · Tue May 14 2024 21:03:30 GMT+0800 (China Standard Time)

nvme-stas uses libnvme mainly to issue commands. Therefore, we would probably be OK if libnvme had a global cache. Other users/apps may not agree though...

One of the places where we use libnvme is udisksd, a heavily threaded long running process (months, years). Any global cache would need locking.

By definition such cache is a stateful object in otherwise mostly stateless library. I would recommend to keep the current level of separation and move the cache lifecycle burden to callers - which are often stateful processes and are able to listen to uevents. Not so easy from API standpoint though.

Or, we need to be able to create a "connection group" and tell libnvme (nvme-cli) when the group can be created.

Not a bad idea at all! A connection group object or a cache object are quite similar. Would be nice to have a way to report connection errors for each record separately.

Daniel Wagner · Answer 56 · Wed May 15 2024 21:36:43 GMT+0800 (China Standard Time)

Good points on the global cache. One thing which could work though is to create/populate a cache which live time is limited to the scan call. Or we go one step further and create an sysfs iterator which first collects all the data without and do the topology scan, e.g reading all entries in /sys/class/nvme. And then we start creating the tree. So basically splitting the operations iterating and creating the tree.

On the connection group. Let's try not to mix too many things here. The issue here is that the tree creation algorithm is excessively looping. This should be fixed eventually, but I'd like to have more tests in place before starting with this work. So the stop gap solution I see is to stop issuing so many reads to sysfs.

sukhi61166 · Answer 57 · Wed May 29 2024 00:43:28 GMT+0800 (China Standard Time)

Thanks everyone for the input. May I know when we should expect the fix to be implemented. I can test the debug build first.

ksingh-ospo · Answer 58 · Wed May 29 2024 02:12:06 GMT+0800 (China Standard Time)

If you're planning to connect to 1000s of storage subsystems, you should definitely consider using a Discovery Controller (DC) and follow what @igaw said and use nvme connect-all.

Currently, we don't have a CDC and don't have any plans in the near future.

We can use connect-all, however, it has a limitation. It won't allow us to create unique host-nqn or host-id. This is a helpful technique to allow us in identifying the connections in an Ethernet trace as we're using a single host to make all the 64 admin and 448 IO connections.

for i in {1..8}
do
sudo nvme connect -t rdma -s 4420 -a $tgt1_ip -i 7 -n nqn.org:nvme.1 -I $host_id$i -q $host_nqn$i
sudo nvme connect -t rdma -s 4420 -a $tgt1_ip -i 7 -n nqn.org:nvme.2 -I $host_id$i -q $host_nqn$i
.....
done

ksingh-ospo · Answer 59 · Wed May 29 2024 02:24:29 GMT+0800 (China Standard Time)

Good points on the global cache. One thing which could work though is to create/populate a cache which live time is limited to the scan call. Or we go one step further and create an sysfs iterator which first collects all the data without and do the topology scan, e.g reading all entries in /sys/class/nvme. And then we start creating the tree. So basically splitting the operations iterating and creating the tree.

On the connection group. Let's try not to mix too many things here. The issue here is that the tree creation algorithm is excessively looping. This should be fixed eventually, but I'd like to have more tests in place before starting with this work. So the stop gap solution I see is to stop issuing so many reads to sysfs.

@igaw: Do we know what changed between nvme-cli v1.16 and v2.2/v2.8 that affected the behavior we've reported?

Is it possible to use an older libnvme that worked better with a newer nvme-cli (say v2.8 or newer)? Probably not, due to API incompatibility but asking in case that could be a workaround for Suki's case.

sukhi61166 · Answer 60 · Wed May 29 2024 12:51:20 GMT+0800 (China Standard Time)

@igaw and @tbzatek
I tried using the connect-all with the released 2.8 nvme-cli version and it still takes 21-22 minutes which is still a lot. I have the straces let me know if you want me to attach that.
real 21m37.056s

Daniel Wagner · Answer 61 · Tue Jun 04 2024 14:43:57 GMT+0800 (China Standard Time)

I am sorry, just too busy with other work.

@ksingh-ospo yes, see the comments above. The topology scan algorithm is very likely causing this regression due it's n^2 or even worse approach.

Daniel Wagner · Answer 62 · Wed Jun 05 2024 23:25:06 GMT+0800 (China Standard Time)

I've created a flamegraph to confirm what we already have figured out. The reading of the attributes is the main contributer:

Hannes Reinecke · Answer 63 · Thu Jun 06 2024 14:00:26 GMT+0800 (China Standard Time)

What we could try, though, is to defer the logic in nvme_scan().
Currently, we're calling 'nvme_scan()', and then all further operations just work on the in-memory tree.
Which is all fine in general if we want to do operations on the entire tree. But for 'connect' we really don't want to do that, we're just interested in finding out if a given entry is present.
So, we could start with an empty tree, and build it incrementally with nvme_lookup_host() and nvme_lookup_ctrl().
Lemme see ...

Hannes Reinecke · Answer 64 · Thu Jun 06 2024 14:28:07 GMT+0800 (China Standard Time)

Check linux-nvme/libnvme#845 for an implementation.

Daniel Wagner · Answer 65 · Fri Jun 07 2024 21:08:40 GMT+0800 (China Standard Time)

@ksingh-ospo @sukhi61166 could you try if the skip namespace trick helps in your setup? It should avoid the scanning of the namespaces. When scanning the namespaces we trigger a ns ident command and this is very likely causing the large delay.

Daniel Wagner · Answer 66 · Fri Jun 07 2024 21:14:04 GMT+0800 (China Standard Time)

BTW, I've tested the idea with caching the data objects (ctrls, subsystems and ns), but it turns out there is no double scan happening. All happens linearly, so my statement about the n^2 is a bit misleading. The file opens are all from scanning the namespaces for each path. That is we have to scan n namespaces for p paths, thus n * p which is not scaling linearly.

sukhi61166 · Answer 67 · Mon Jun 10 2024 14:38:19 GMT+0800 (China Standard Time)

@ksingh-ospo @sukhi61166 could you try if the skip namespace trick helps in your setup? It should avoid the scanning of the namespaces. When scanning the namespaces we trigger a ns ident command and this is very likely causing the large delay.

Will try and update as soon as I can.

Daniel Wagner · Answer 68 · Mon Jun 24 2024 17:30:29 GMT+0800 (China Standard Time)

Did you have time to test it? I don't mind merging it even if it doesn't help your case. It should make things better anyway, as it avoid unnecessary work.

sukhi61166 · Answer 69 · Mon Jun 24 2024 17:36:08 GMT+0800 (China Standard Time)

@igaw
I didn't get chance to test yet. Is it possible to get the executable binary? Our expert who can build is out for few days. If you can give me the debug build I can test right way. If you want to merge the changes even then let me know I will pick up that build and test it.

Sorry about the delay.

Daniel Wagner · Answer 70 · Mon Jun 24 2024 17:42:08 GMT+0800 (China Standard Time)

Alright, then I'll merge it (as it avoid work anyway). This will trigger a AppImage build for testing. You can get it then here:

https://monom.org/linux-nvme/upload/AppImage/nvme-cli-latest-x86_64.AppImage

just wait until it has been updated.

sukhi61166 · Answer 71 · Mon Jun 24 2024 18:09:36 GMT+0800 (China Standard Time)

Alright, then I'll merge it (as it avoid work anyway). This will trigger a AppImage build for testing. You can get it then here:

https://monom.org/linux-nvme/upload/AppImage/nvme-cli-latest-x86_64.AppImage

just wait until it has been updated.

I just grabbed the latest master and going to test it. Thanks a lot!