ory / keto

The most scalable and customizable permission server on the market. Fix your slow or broken permission system with Google's proven "Zanzibar" approach. Supports ACL, RBAC, and more. Written in Go, cloud native, headless, API-first. Available as a service on Ory Network and for self-hosters.

Home Page:https://www.ory.sh/?utm_source=github&utm_medium=banner&utm_campaign=keto

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Poor query performance as traversal width increases

e-compton-bb opened this issue · comments

Preflight checklist

Ory Network Project

No response

Describe the bug

In our system, we have "projects" which a user can view and "clips" which a user can view. If a clip is in a project, then the user can view the clip transitively. This works fine when a clip is in a very small number of projects, but if a clip is in ~200 different projects, then it slows down to >1s to check if the user can view the clip.

Running Keto locally, it appears to be churning through requests against SQL, loading every relationship into memory before checking if the user has access to any of them.

Reproducing the bug

Permissions model:

class User implements Namespace {}

class UserGroup implements Namespace {}

class Project implements Namespace {
  related: {
    owners: User[];
    editors: (User | UserGroup)[];
  };
  permits = {
    view: (ctx: Context): boolean =>
      this.related.owners.includes(ctx.subject) || this.related.editors.includes(ctx.subject),
  };
}

class Clip implements Namespace {
  related: {
    owners: User[];
    parents: Project[];
  };
  permits = {
    view: (ctx: Context): boolean =>
      this.related.owners.includes(ctx.subject) || this.related.parents.traverse((parent) => parent.permits.view(ctx)),
  };
}

Create 200 relationships of different projects all pointing to the same clipId, make a user an owner or editor on any one of the project.

Check if the user can access the clip

Expected: The query to still be fairly fast <50ms
Actual: On an M2 Mac with SQLite >150ms, but on a real database deployed in a container can take >1s

Relevant log output

No response

Relevant configuration

log:
  level: warn

namespaces:
  location: file:///home/nonroot/model.ts

serve:
  read:
    host: 0.0.0.0
    port: 4466
  write:
    host: 0.0.0.0
    port: 4467

Version

v0.12.0

On which operating system are you observing this issue?

macOS

In which environment are you deploying?

Docker

Additional Context

No response

After some more digging, the main problem here seems to be the breadth-first search in checkExpandSubject. It grabs every Project tuple from the DB and then it starts looking through each one for the user. Then each lookup is a separate query which has it's own round trip time. Which explains why it's so much slower in a real database.

Hey Edward, I believe that this is not yet optimized properly. If you have good ideas how this could be optimized we're happy to hear them!

Could switch from a breadth-first search to a depth-first search. In this case the relationship chain we're trying to find is: Clip -> Project -> User, so by enumerating all the Projects first we always get worst-case time.

There's also a lot of chattiness between the container and the DB, so if more work could be offloaded to the DB it could make a big difference. This seems like a really good use-case for Postgres's WITH RECURSIVE clauses, although the current implementation seems to be independent of particular SQL versions
https://www.dylanpaulus.com/posts/postgres-is-a-graph-database