Query: failing to find files after a soak test

Question

Query: failing to find files after a soak test

schnoberts1 opened this issue a month ago · comments

Hi,

I am soak testing a pattern of behaviour that looks like this when printing the first argument of LFS_TRACE:

lfs_file_open(%p, %p, "%s", %x)
lfs_file_open -> %d
lfs_file_close(%p, %p)
lfs_file_close -> %d
lfs_remove(%p, "%s")
lfs_remove -> %d
lfs_file_open(%p, %p, "%s", %x)
lfs_file_open -> %d
lfs_file_size(%p, %p)
lfs_file_size -> %ld
lfs_file_write(%p, %p, %p, %lu)
lfs_file_write -> %ld
lfs_file_close(%p, %p)
lfs_file_close -> %d
lfs_file_open(%p, %p, "%s", %x)
lfs_file_open -> %d
lfs_file_close(%p, %p)
lfs_file_close -> %d

I am accessing one file.

In code, this looks more like:

while (true) {
open file, if it can be opened, close it and remove it.
open file, write a small amount of text to file < 128 bytes. close file.
open file for read. close file.
delay(100ms)
}
My partition is approx 10MB. Block size is 4096.

What I find is that after a few hundred iterations it stops being able to open files. The spread of iterations to failure is large, ranging from 5 at the smallest end to over a 1000 at the larger end.

It normally fails in dbc_lfs_dir_fetchmatch.

I have seen this issue in version 1 of littlefs and have upgraded to the latest and greatest littlefs 2 and this commit: 225bb93e620ba5d0de288ce08099f10b5e26d602 and still see it.

Is this something other people have seen?

I'm using this code to access a USB stick. I'm running mbed with the https://github.com/arduino-libraries/Arduino_USBHostMbed5 to connect to the USB device and the mbed block device code to mediate the USB access.

I have a block device based ring buffer solution mediated by the software and this runs in soak for days without an issue so for simple seek, read, write operations with Arduino_USBHostMbed5 and the mbed block device works fine as does the USB key.

I'm a bit of a loss at the moment so debugging inspiration would be appreciated. Is there a way to dump out the directory structure so I can look for corruption? I find the random nature of it puzzling.

It's worth noting that when this happens the process is slow, it takes a second or so for the open to return with an error, like it's scanning disk, garbage collecting or moving this around or something. It doesn't matter if I disable levelling. Increasing the cache size makes the problem worse.

I have tried multiple USB keys and replicated it on more than one device.

Andy Schneider · Answer 1 · Fri Apr 26 2024 06:43:55 GMT+0800 (China Standard Time)

...on further investigation it seems to always fail when trying to read block 0, offset 0 or block 1 offset 0. Once it's in this state it's stuck, so perhaps there is an issue with the mbed block reading.

Andy Schneider · Answer 2 · Fri Apr 26 2024 07:26:14 GMT+0800 (China Standard Time)

The brand of USB key I have seems to time out after blocks 0 and 1 have been bashed at for a while. From that point it fails to respond to USB transfer requests. The block ring buffer works fine because it doesn't bash away at a couple of blocks.

Christopher Haster · Answer 3 · Tue Apr 30 2024 01:10:19 GMT+0800 (China Standard Time)

Interesting, I wonder if this is some sort of early wear protection.

One thing you could try is setting block_cycles to a small value. block_cycles tells littlefs how many time to write to a block before moving the data to a new block. Setting block_cycles=5 for example, will make littlefs move the metadata away from blocks 0x{0,1} after erasing those blocks 5 times.

Christopher Haster · Answer 4 · Tue Apr 30 2024 01:16:18 GMT+0800 (China Standard Time)

I'm a bit of a loss at the moment so debugging inspiration would be appreciated. Is there a way to dump out the directory structure so I can look for corruption?

This may not be useful anymore, but in case it's useful, there are a couple of hacky Python scripts for parsing out littlefs's metadata. readtree.py prints the filesystem tree for example.

Andy Schneider · Answer 5 · Tue Apr 30 2024 01:19:05 GMT+0800 (China Standard Time)

@geky , thanks for both suggestions. In the end I went and bought a different USB key and the problem went away. I think the fact the failure wasn't predictable, i.e. could fail in 5 rounds or maybe a 1000 suggests a possible firmware bug on the USB key. Certainly it got itself in a state where the USB data transfer failed and all subsequent data transfer requests were never responded to. I know this because I instrumented the USB driver to see what was happening.

Christopher Haster · Answer 6 · Tue Apr 30 2024 01:19:44 GMT+0800 (China Standard Time)

It's worth noting that when this happens the process is slow, it takes a second or so for the open to return with an error, like it's scanning disk, garbage collecting or moving this around or something.

This is a bit of weird behavior to too much wear on blocks 0x{0,1}. I wonder if something is going on with the FTL inside the USB stick. These will usually do their own internal wear-leveling/garbage collection which is why disabling wear-leveling is suggested.

But if Arduino_USBHostMbed5 is doing something clever in how it maps prog/erase to USB commands it may be confusing the device?

Christopher Haster · Answer 7 · Tue Apr 30 2024 01:24:23 GMT+0800 (China Standard Time)

@schnoberts1, glad to hear you found a solution :)

That does sounds like something going wrong on the USB device side. Maybe Arduino_USBHostMbed5 is using more advanced commands without checking feature flags? Maybe the USB stick implemented commands/flags incorrectly? Hard to know, USB can get quite complicated...

Andy Schneider · Answer 8 · Tue Apr 30 2024 01:25:39 GMT+0800 (China Standard Time)

Arduino_USBHostMbed5's code looks pretty straightforward and didn't seem to be doing anything particularly odd, that said I've not done much more than skim the code and instrument it. What's also interesting is that we have a number of these USB sticks (all the same model) and they all exhibit the exact same behaviour. I also have a working "ring buffer on a partition" that uses prog/erase and works absolutely fine on the key. It's clearly replicable though.