Network dropped connection on reset

Question

Network dropped connection on reset

tkenbo opened this issue 2 months ago · comments

We have several test scripts to test nvme spec compliance and some of them intermittently fail for the error in title and the drive is inaccessible unless power cycle it.

2024-05-08 09:34:11.849326 INFO > cmd: nvme reset /dev/nvme1
2024-05-08 09:35:16.406425 ERROR Unable to extract status code from output:
Reset: Network dropped connection on reset

2024-05-08 09:35:16.406788 CRITICAL type: NvmeCli Error(CalledProcessError)
returncode: None
stderr: Reset: Network dropped connection on reset

This error tends to occur when we test multiple drives on one system concurrently and several investigations led us to prove this was a result of interference between "nvme reset" and "nvme list" command.
When one of drives on the system is being reset by one test process, other processes send a list command potentially led the error.
I created a simple python script that one thread to send a reset to a target drive repeatedly and the other thread to send list command. I tried this script on several Gen4 drives from various vendors and they all failed within 10 min to 1 hour. I also tried one of Gen3 drives but that did not fail.

When this happened, the system log always shows the following messages.

May  8 09:35:15 SM-221H-527 kernel: nvme nvme1: I/O 16 QID 0 timeout, disable controller
May  8 09:35:15 SM-221H-527 kernel: nvme nvme1: Removing after probe failure status: -4

I was able to repeat this error on various systems and kernel 4.18 (Rocky8.8), 5.13(Rocky9.2) or 5.14 (Rocky9.3)

It'd much appreciated if anyone could enlighten me on this.
I can provide a script I created if requested.

Daniel Wagner · Answer 1 · Thu May 16 2024 05:14:09 GMT+0800 (China Standard Time)

Does this happen with latest nvme-cli and kernel? Older verison nvme-cli and kenrel will issue request for nvme list

Akash Kumar Singh · Answer 2 · Sun May 19 2024 19:57:41 GMT+0800 (China Standard Time)

Hi @tkenbo,

I had seen this issue, we added a condition to power cycle of the device once its inaccessible and then restart the process from where its left.

But, i think its a kind of bugs, is it possible to share your host properties like OS, Kernal version, nvmi-cli version etc. along with the python script, which you had mention.

I will try to reproduce and investigate more about this issue.

thanks'

tkenbo · Answer 3 · Mon May 20 2024 23:48:54 GMT+0800 (China Standard Time)

Hi @AkashKumarSingh11032001,

Thank you for your response.
We tested this on Rocky Linux 8.8 (kernel 4.18), 9.2 (kernel 5.13), 9.3 (kernel 5.14). And Gen3 Samsung SSD did not show this issue but all Gen4 SSD I tested showed it.
I also tested nvme-cli, 2.21 and 2.9.1 (latest) and both showed the issue.
I am struggling to install the latest kernel on my Rocky system, so much appreciated you looking into this.
In our testing, if it fails, it would fail by a few hours longest, so running several hours should be long enough.

for whatever reason, I could not attach the script directly, so I put it as a snippet.

import subprocess
import time
import threading
import sys

fail = False

def list_loop():
    global fail
    print('list starts')
    try:
	    while not fail:
        	result = subprocess.run('nvme list', shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE,encoding='utf-8')
        	time.sleep(0.1)
    except subprocess.CalledProcessError as e:
            print('list thread error')
            print(e)
            fail = True

    print('list ends')

def reset_loop(dev):
    global fail
    print('reset starts')
    try:
	    while not fail:
        	result = subprocess.run('nvme reset ' + dev, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE,encoding='utf-8')
    except subprocess.CalledProcessError as e:
            print('reset thread error')
            print(e)
            fail = True

    print('reset ends')

if __name__ == '__main__':

    if len(sys.argv) == 1:
       print('ERROR: need device node "python xxx.py /dev/nvme1"')
       sys.exit(1)

    print('starts on ' + sys.argv[1])
    list = threading.Thread(target=list_loop)
    reset = threading.Thread(target=reset_loop, args=(sys.argv[1],))

    list.start()
    reset.start()

    list.join()
    reset.join()

tkenbo · Answer 4 · Tue May 21 2024 22:32:49 GMT+0800 (China Standard Time)

hi @igaw

I tried the followings.

nvme-cli 2.9.1 on older kernel 4.18 but my script errored out in 1 hour
kernel 6.9.1 on older nvme-cli 2.2.1 and this lasted 15 hours and I stopped.

With new kernel, I see "no usable path" message in the system log but the drive seems okay.

May 21 08:08:42 sm-110p-140 kernel: nvme nvme5: resetting controller
May 21 08:08:42 sm-110p-140 kernel: block nvme5n1: no usable path - requeuing I/O
May 21 08:08:42 sm-110p-140 kernel: block nvme5n1: no usable path - requeuing I/O
May 21 08:08:42 sm-110p-140 kernel: nvme nvme5: D3 entry latency set to 10 seconds
May 21 08:08:42 sm-110p-140 kernel: nvme nvme5: 52/0/0 default/read/poll queues

If this is a result of some fixes in kernel driver, do you have any idea which version of kernel has the change?

Daniel Wagner · Answer 5 · Wed May 22 2024 18:41:07 GMT+0800 (China Standard Time)

It looks like the driver recovers from the reset. The requeue message means that outstanding I/Os are not dropped instead requeued and issued as soon the device recovers. The last line indicates all is good again.

The fix I am referring consist in the kernel exposing additional sysfs entries and libnvme not issuing any commands when it finds all necessary sysfs entries for a topology scan.

You need a kernel which ships a1a825ab6a60 ("nvme: add csi, ms and nuse to sysfs") (kernel v6.8).

Though it could be another fix in the kernel which helps in your setup. Though from what you are describing it sounds like the hang is caused by issuing a command by nvme list.