Recursive folder enumeration

Question

Recursive folder enumeration

Arlodotexe opened this issue a year ago · comments

We still need to add recursive folder enumeration to the storage abstraction.

For a time, I was avoiding anything related to recursion. There was a lot of questions we couldn't answer.

The main blocker before was deciding how to do parallelism. If we were going to have this on the interface itself, it needed to be finalized before we added it. That's not a concern anymore, thanks our extension method + fastpath interface approach.

We couldn't put time towards solving this while the rest was up in the air. We've figured out all the core bits as of our last big breaking update, so now we can add this!

We'll start with simple, sequential enumeration, like we did with GetItemRecursiveAsync. Parallelism options can be added as an overload later.

Implementors of fastpath methods can still use parallelism under the hood (e.g. anything that makes HTTP calls), but the consumer can't configure it for now. We'll do that in another update, there's too many questions around what the API would look like (breadth vs depth first, etc)

HEIC to JPEG · Answer 1 · Tue Apr 11 2023 17:28:35 GMT+0800 (China Standard Time)

The fastest way to enumerate the filesystem would be to use itteration, not recursion.

When dealing with SSD or cached HDD, the code will be the performance factor, not the IO.

For readability and ease of coding, recursion is a good option and you can use parallel to do this easily, something like

void GetItemRecursiveAsync(folder)

process folder (add it to a list or something as a master return value)
enum all items in a folder
process files during the enum (add them to a list or something as a master return value)
store folders during the enum (local to this method)
parallel forreach folders, call GetItemRecursiveAsync(folder)

Arlo · Answer 2 · Wed Apr 12 2023 02:18:28 GMT+0800 (China Standard Time)

The fastest way to enumerate the filesystem would be to use itteration, not recursion.

I wouldn't worry on this too much. As long as it's tail-call recursive, the .NET compiler will optimize any recursive method calls into a loop for us, giving us the flexibility to do things like this:

public async IAsyncEnumerable<IStorable> GetItemsRecursive(this Folder folder)
{
  await foreach (var item in folder.GetItemsAsync())
  {
     yield return item;

     if (item is IFolder folder)
     {
        await foreach (var subItem in folder.GetItemsRecursive())
          yield return subItem;
     }
  }
}

HEIC to JPEG · Answer 3 · Wed Apr 12 2023 02:21:23 GMT+0800 (China Standard Time)

Obviously not GetFilesAsync :)

Arlo · Answer 4 · Wed Apr 12 2023 02:25:44 GMT+0800 (China Standard Time)

Obviously not GetFilesAsync :)

Typo, updated.

Also, I double checked and this method is not tail-call recursive. According to GPT4:

No, this code is not tail-call recursive. In tail-call recursion, the recursive call must be the last operation performed in the function. In this case, the recursive call to folder.GetItemsRecursive() is inside a loop, and there is a yield return statement after the recursive call. This prevents the compiler from optimizing the recursion as a tail call, and as a result, it will use additional stack space for each level of recursion.

The code you provided is an example of a generator method using C#'s async and yield keywords to asynchronously and lazily enumerate items from a folder hierarchy. It is a standard recursive implementation rather than tail-call recursive.

Can we refactor this to be tail-call recursive?

Converting this code to be tail-call recursive is not straightforward, as C# does not support tail-call optimization for async methods or methods with yield return statements. However, we can refactor the code to avoid recursion by using an explicit stack data structure. This way, we can prevent stack overflow exceptions for deep folder hierarchies.
Here's an example of how you can refactor the code using an explicit stack:
public async IAsyncEnumerable<IStorable> GetItemsRecursive(this Folder folder)
{
    var foldersStack = new Stack<Folder>();
    foldersStack.Push(folder);

    while (foldersStack.Count > 0)
    {
        var currentFolder = foldersStack.Pop();

        await foreach (var item in currentFolder.GetFilesAsync())
        {
            yield return item;

            if (item is IFolder subFolder)
            {
                foldersStack.Push(subFolder);
            }
        }
    }
 }

HEIC to JPEG · Answer 5 · Wed Apr 12 2023 02:28:55 GMT+0800 (China Standard Time)

You need to use enumerateSystemInfos, not get items Async. Otherwise the yield is less effective.

it’s also very hard to use the stack if you want to move to multi-threaded enumerations because you can’t use while stack.count > 0 with several threads running the enumeration

HEIC to JPEG · Answer 6 · Wed Apr 12 2023 02:39:52 GMT+0800 (China Standard Time)

Another thought is the result set order. They way the code is now, and let’s assume c:\ , you will get a few files from the root, not all, then a sub folder, then some files from that, then a subfolder, etc, etc. you’re traversing the folders before the enumeration of the files is complete. That’s okay as long as you explicitly state the order returned of files and folders is totally random and the IStorable item may be from any part of the folder tree.
It might just be me, but I’d expect, all root files, then the first subfolder with its files, then it’s subfolders, etc.
Hope that makes sense.

Arlo · Answer 7 · Wed Apr 12 2023 02:48:40 GMT+0800 (China Standard Time)

I think you're maybe focusing too hard on the System.IO implementation. My example (as stated) is the fallback slowpath used by the extension method, when the implementation does not implement the fastpath interface.

Arlo · Answer 8 · Wed Apr 12 2023 02:53:26 GMT+0800 (China Standard Time)

you’re traversing the folders before the enumeration of the files is complete.

This was another hangup with adding anything recursive. We don't have a way to determine how to enumerate the items (Breadth first vs Depth first), we can only ask that all items be returned.

As we planned with parallelism, we can add overloads that enable customizing this at a later time. We'll start with simple, sequential enumeration, like we did with GetItemRecursiveAsync.

As for deciding bread vs depth first search in the extension method's slowpath, we should do a bit more research first.

HEIC to JPEG · Answer 9 · Wed Apr 12 2023 15:13:57 GMT+0800 (China Standard Time)

[breadth vs depth] I think this is going to depend on the use case.

If I want to display the items in a list to the user, think File Explorer, then I need breadth first, also the case of something like file copying (yes I think files). I can't think of a use case where I'd want depth first.

Also, keeping with files, you'd be enumerating the files (Async, Existance is known) and then using a Systemfile, which is synchronious and calls IO 4 times, just to go "oh yes, it does exist". Massive performance hit.

Arlo · Answer 10 · Thu Apr 13 2023 04:37:35 GMT+0800 (China Standard Time)

Massive performance hit.

Performance of SystemFile and SystemFolder in discussed in #25.

If I want to display the items in a list to the user, think File Explorer, then I need breadth first, also the case of something like file copying (yes I think files). I can't think of a use case where I'd want depth first.

Had to brush up on the subject, so I did some back and forth between GPT4, Bing and Bard, and came up with some insights (revised and fact checked)

Comparison of DFS and BFS

Space efficiency: DFS is generally more space-efficient when traversing tree-like filesystems, as it only needs to store the current path on the stack. However, if the filesystem is a graph, BFS might be more space-efficient, as it can store the entire graph in a single queue.
Search efficiency: DFS is more efficient for finding deep files or directories in the filesystem, while BFS is more efficient for finding shallow files or directories. The relative efficiency depends on the specific use case and the location of the target files or directories.
Optimality: BFS guarantees to find the shortest path if the filesystem is a tree and there's a unique shortest path between any two nodes. In cases where the filesystem is a graph or has multiple shortest paths between nodes, DFS might be more optimal.
Handling of cycles: Depends on the filesystem structure. If the filesystem contains cycles, BFS handles them more effectively, as it visits each node only once and marks it as visited to prevent infinite loops. DFS can potentially get caught in infinite loops due to its recursive nature, although it is possible to overcome this issue by marking nodes as visited and using other techniques. If the filesystem doesn't contain cycles, both DFS and BFS will handle it equally well.

Ultimately, the choice between DFS and BFS for a specific filesystem depends on the characteristics of the filesystem and the particular use case. Factors such as the depth of the file hierarchy, the amount of available memory and the desired search speed should be considered when choosing between DFS and BFS.

What this means for us

When designing this API, we'll need to take into account:

Characteristics of the filesystem (network based / local /etc). The implementor would have this information, and should be able to provide a parameterless default "fastest" approach.
Depth of the file hierarchy. The consumer has this information, so they should be able to select the approach that works best for them (BFS vs DFS, Parallelism).

Additional considerations

Doing more research, I remember another reason I didn't approach recursion until now - there's a lot of ways to do this, variations of DFS and BFS and other graph tracing algorithms. See https://www.baeldung.com/cs/dfs-vs-bfs-vs-dijkstra

This is why in the original AbstractStorage proposal, we created an IFolderScanner and implemented it with DepthFirstFolderScanner. These didn't make it into OwlCore.Storage, but Strix is still using them here and here.

Need to think this over more. Not sure if extension methods are the way to go here.