microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.

Home Page:https://microsoft.github.io/kernel-memory

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SimpleFileStorage - slash '/' chars are not allowed in filename (Azure Blob Name)

rosieks opened this issue · comments

Context / Scenario

I want to import document using SimpleFileStorage

What happened?

I switched document storage from Azure Blob Storage to SimpleFileStorage and suddenly it stoped working as file name contains slash which is not allowed with SimpleFileStorage.

Importance

a fix would make my life easier

Platform, Language, Versions

.NET, C# - KernelMemory 0.24.231228.5

Relevant log output

System.ArgumentException: The file name default/01cf76edae894daf848edc53c08c7c74202401020228414515055/Some Folder/Some Folder/Contoso's-Use-Case.md contains some invalid chars: slash '/' chars are not allowed
   at Microsoft.KernelMemory.FileSystem.DevTools.DiskFileSystem.ValidateFileName(String fileName)
   at Microsoft.KernelMemory.FileSystem.DevTools.DiskFileSystem.WriteFileAsync(String volume, String relPath, String fileName, Stream streamContent, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.ContentStorage.DevTools.SimpleFileStorage.WriteFileAsync(String index, String documentId, String fileName, Stream streamContent, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Pipeline.BaseOrchestrator.UploadFormFilesAsync(DataPipeline pipeline, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Pipeline.BaseOrchestrator.UploadFilesAsync(DataPipeline currentPipeline, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Pipeline.InProcessPipelineOrchestrator.RunPipelineAsync(DataPipeline pipeline, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Pipeline.BaseOrchestrator.ImportDocumentAsync(String index, DocumentUploadRequest uploadRequest, CancellationToken cancellationToken)

@rosieks that's by design, in most systems the forward slash symbol is used to separate folders.

I understand that it's limitation of the file system. But I think it should be handled somehow between. Actually right now I'm confused what's the purpose of this parameter?

@rosieks could you provide a snippet of code to reproduce the exception?

Here is the code that I use:

var blobStorage = new Azure.Storage.Blobs.BlobContainerClient(request.ConnectionString, request.ContainerName);
foreach (var blob in blobStorage.GetBlobs())
{
    var blobClient = blobStorage.GetBlobClient(blob.Name);
    using var blobContent = blobClient.OpenRead();
    await _memory.ImportDocumentAsync(blobContent, blob.Name);
}

you're using this API:

/// <summary>
/// Import any stream from memory, e.g. text or binary data, with details such as tags and user ID.
/// </summary>
/// <param name="content">Content stream to import</param>
/// <param name="fileName">File name to assign to the stream, used to detect the file type</param>
/// <param name="documentId">Document ID</param>
/// <param name="tags">Optional tags to apply to the memories generated by the document</param>
/// <param name="index">Optional index name</param>
/// <param name="steps">Ingestion pipeline steps, optional override to the system default</param>
/// <param name="cancellationToken">Async task cancellation token</param>
/// <returns>Document ID</returns>
public Task<string> ImportDocumentAsync(
    Stream content,
    string? fileName = null,
    string? documentId = null,
    TagCollection? tags = null,
    string? index = null,
    IEnumerable<string>? steps = null,
    CancellationToken cancellationToken = default);

passing blobContent as the content parameter, and blob.Name as fileName.

blob.Name can contain virtual directory names (see Azure blobs docs) separated by the slash char like file paths in a local hard drive. Try passing only the last part of blob.Name, after blob's virtual directories, or replacing the virtual directory separator with an underscore.

I think that if fileName is used to detect file type then content storage should assume that it can use it as a proper file name. Becuase it may be that even that I have only file name without path then filename itself has some characters that are not supported in given storage (eg non UTF8 characters)
So then the question is what's the proper way to keep reference to file. E.g. I asked a question and I want to have answer with reference to files that were used to answer the question. Should I use tags for that?

When providing a response, the API includes details about the sources used (refer to citations). These citations should offer enough information to connect back to the data stored in Azure blobs. While the solution doesn't support storing custom metadata, experimenting with tags is an option. However, it's worth noting that tags aren't specifically designed for this purpose, and the outcomes may not be perfect.

Ok, so those citations have reference to content stored in content store? Not the one that I provide (e.g. to my website, git repo, wiki)

yes, citations point to data in the content store, which can be Azure blobs or Local disk currently. The system can be extended to support other storage types, similarly to how it's been extended to support Redis/SQL Server/Elasticsearch and others for vector storage