dagger / dagger

An engine to run your pipelines in containers

Home Page:https://dagger.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Namespace cache volumes by module

shykes opened this issue · comments

Problem

Cache volumes exist in the engine's global namespace: if modules A and B each create a cache volume with id foo, they will share read and write access to the same volume, as long as they are run on the same engine. This allows one module to corrupt the data of another module, either accidentally or maliciously.

Solution

Namespace the name of cache volumes, so that each module can only read and write to its own cache volumes.

See also

This issue is an up-to-date reboot of #3345

What does module mean in this context? Each instance/call of a module gets separate cache volumes?

What if I have multiple calls to a Go module within the same project and I want to share the module/build cache between them?

What does module mean in this context? Each instance/call of a module gets separate cache volumes?

What if I have multiple calls to a Go module within the same project and I want to share the module/build cache between them?

I mean the full canonical address of the module, for example github.com/shykes/daggerverse/hello.

All instances of the same module would share the same volume, as long as they share the same persisted cache volume storage.

I agree we need to do something like this, but do want to note that the fact that cache volumes can be shared across modules is highly beneficial to performance for many common use cases. E.g. anything that uses Go benefits from sharing a cache volume for downloading deps (and possibly build cache, etc.).

Obviously in the choice between security-by-default and performance, security-by-default should win.

But in past discussions around all this the idea of cache volumes being tied to modules but still allowing modules to pass their own cache volumes around came up and is still worth considering imo. So say you are writing a module that calls to a bunch of other modules that do "go things"; you should be able to define a cache volume and pass those cache volumes to be used by modules you call.

  • This of course requires that the modules you are calling accept an optional cache volume to use as an override for their default (private) one, which probably just needs to become a best practice in this scenario.

That seems like one reasonable way of maintaining security by default while still allowing opt-in performance benefits. I'm sure there's other approaches possible too.

Sure, since there's a cache volume type, it makes sense that you can pass it as argument. I've never seen that done so hadn't even thought about it, but it seems reasonable to me that we don't break it. As long as it doesn't break namespacing of modules (which I don't think it would), then I don't see any problem with that.

So these things should all be true:

  1. When a cache volume is loaded by key (cacheVolume(key: String!)), the key is always namespaced by module address. There is no way to bypass this.
  2. A cache volume ID can be used by any module. But a module cannot guess the ID of another module's cache volume: it needs to be passed explicitly.
  3. Ideally, cache volume IDs do NOT become sensitive values to be scrubbed from screenshots and logs (because ie. they are not reusable across sessions)

Yeah SGTM, coincidentally everything required to implement enforcement of only using cache volumes you create or are explicitly passed is also what's required to safely pass sockets around (#6747), which I'm working on right now, so should be feasible to implement all this in the very near future.

@sipsma putting this back on your radar since you're working on improved persistence of cache volumes. Seemed like a good time to keep this one in mind.

@shykes Yep I've been thinking about it. As of right now I've found it to be orthogonal to the efforts in #8004. The namespacing is something we'd implement on a higher level that's agnostic to the underlying storage.

The only connection point is that storage of cache volumes are keyed by their definition (just a hash of its definition essentially), so namespacing is mainly a matter of incorporating another string into that definition.

@sipsma but wouldn't it be operationally convenient if the path of each volume on the host filesystem were human-readable and keyed by volume name instead of hash? Eg /var/lib/dagger/cachevolumes/node_modules.

With namespacing that might become: /var/lib/dagger/cachevolumes/github.com%2Fshykes%2Fdaggerverse%2Fnodejs/node_modules

wdyt?

@shykes whether to store the cache volume with human-readable names of a hash is already independent of namespacing by volume.

A cache volume is currently a struct like:

type CacheVolume struct {
  Key string
  Uid int
  Gid int
  Source *Directory
}

Where right now the key in the storage is a hash of those fields (/var/lib/dagger/volumes/deadbeef123).

  • The uid/gid/source fields are all necessary since they already are part of our existing cache volume API and require separate cache volumes instances.

We could already change it to be human readable by just making the path instead be like /var/lib/dagger/volumes/<key>/<uid>-<gid>/<source dir hash>.

  • The reason I went with a hash for now is that you previously gave feedback that it's better for the structure to be opaque, which was fine by me since users can interact with the volumes via the API and that's better in the sense that we can change the layout more freely without being backwards incompatible.

Point being that if we do go with human readable paths for cache volume storage paths, then namespacing is just a matter of appending another field to that struct and the underlying path. The actual implementation of namespacing by module is still totally independent.

OK makes sense! Sorry I totally forgot I said that previously 😁 Keeping the format opaque to avoid a one-way door, does make sense.