microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.

Home Page:https://microsoft.github.io/kernel-memory

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[QUESTION] About security filters scenarios with many groups

luismanez opened this issue · comments

Looking for advice here. My scenario is basically applying security trimming to the Imported documents (pretty common one in companies using SharePoint, I guess).

The approach is to index permissions of each document using a custom Tag (storing the Security object IDs, as recommended here: https://learn.microsoft.com/en-us/azure/search/search-security-trimming-for-azure-search). Yeah, with this approach, you need some Sync job to refresh permissions (incremental / full crawls, web hooks, whatever works for you).

Once the permissions are indexed using custom tags, we can use the MemoryFilters parameter to apply the property filters: https://microsoft.github.io/kernel-memory/security/filters

However, let’s say now that we have a user that is member of many many groups (I.e: 500), the MemoryFilter list using this code:

var answer = await memory.AskAsync(question,
                                   filters: new List<MemoryFilter>
                                   {
                                      MemoryFilters.ByTag("user", "Taylor"),
                                      // ... OR ...
                                      MemoryFilters.ByTag("user", "Andrea"),
                                   });

Will have a bunch of “ByTag” clauses, and will be translated to an Azure Search $filter like:

(tags/any(s: s eq 'Authorized:xxxxxxx’)) or (tags/any(s: s eq 'Authorized:xxxxxx’)) or (tags/any(s: s eq 'Authorized:xxxxx’)) or …

And with many groups, you willl get this error:

Invalid expression: Recursion depth exceeded allowed limit.\\r\\nParameter name: $filter

For this scenarios, maybe a search.in filter would work better:

(tags/any(s: s search.in(t, 'Authorized:xxxxxx’,’Authorized:xxxxx’,’Authorized:xxxxx’,….))

That would require changes in the BuildSearchFilters method.

Anything in the backlog for these scenarios?

Interesting problem, do you know how SharePoint and Active Directory scale access control to similar scenarios? It might be a reverse lookup, e.g. after fetching a list of records, filter out those that are not accessible, client side. In KM that would mean fetching all relevant records, regardless of user access, and filtering them out on the client side, before the user can consume them.

As far as I know, SharePoint indexes Permissions too, so the search engine apply the security trimming in the server side. I can confirm so, cos for instance, if let's say Bob has access to Document1, Bob runs a search query and Document1 is returned. Right then, and admin changes permissions for Document1 and remove Bob access. Bob still will see Document1 in search results for a while, until the Incremental crawl re-indexes Document1 permissions.

Actually, our approach is working fine, but we have needed to download KM source code, and edit BuildSearchFilters method to compose a search.in query if we find multiple filters with the same key:
tags/any(s: search.in(s, '.....

We're happy to do a PR but wondering if there's something better.

Very curious to know how M365 Copilot solves this same problem 😄 ? I don't think it gets back all the documents, and then starts calling MS Graph to check if the current user has permissions on each document.

happy to take the PR if you can work on it.

Trying to think about how one would store "this document is accessible to user1, 2, 3.... 100000", there might be multiple approaches. E.g. one could be about using virtual groups stored in meta-tables, auto-clustering users to reduce the cardinality of those filters. Something like:

doc1 is accessible to u1,u7,u8,u100,u102,u103
doc2 is accessible to u1,u7,u8,u100,u102,u888

vgroup1=u1,u7,u8, u100,u102
vgroup2=u103
vgroup3=u888

and so on...

Thanks @dluc
I'll do the PR in the next days.

If I'm understanding right, in our scenario, the meta-tables is Azure AD, and the virtual groups are AAD Groups / M365 Groups. So, the document is indexed with a custom tag "PrincipalsAuthorized", and there we stored the different Groups that have access to the document (and also UserIds, if only specific users are configured). Then, document is indexed with less than 20 IDs in most of the cases.

However, the problem is when you want to query only documents where the current user has permissions. In this case, if a user is member of 500 groups (pretty common in M365, as every Team is a M365 Group in Azure AD), the search query will have 500 "conditions":

(tags/any(s: s eq 'PrincipalsAuthorized:xxxxxxx’)) or (tags/any(s: s eq 'PrincipalsAuthorized:xxxxxx’)) or .......

This query will crash in Azure Search (Invalid expression: Recursion depth exceeded allowed limit.\r\nParameter name: $filter)

The Search.In query works fine for these scenarios, but the KM must keep also the possibility of combining multiple MemoryFilters (this is what we're doing now before sending the PR) ...

hey @dluc I've sent a PR with our solution to this issue. Please, give it a try, as although is working for us, we're only using a Tag, and might be missing something.

Many thanks!

Hi @dluc
sorry to bother, but we are upgrading our (big) solution to .NET 8, and we'd love to have our PR merged, so we can rid off our (old) copy of KM code. I know you are busy, but can you at least let me know if the PR looks good and likely will be merged in 2-3 weeks? otherwise, I will copy latest KM source code and will add my changes, but is not cool 😄

Many thanks!

@luismanez assuming that the results are the same and the PR is improving the query (not adding new features), I plan on running a few tests and make sure I fully understand the new syntax, then yes the PR should be merged and released soon, I think 2 weeks max 👍

Closing this one, as has been addressed in Package 0.27.240207.1.
thanks for your help @dluc !