There is no way to define a service without a dockerfile or image, right? How much work would a fork / PR for this be roughly?

Question

There is no way to define a service without a dockerfile or image, right? How much work would a fork / PR for this be roughly?

SteffRhes opened this issue a year ago · comments

(duplicated issue of docker/compose#10758 , as I've been told this repo is the better place for such kinds of questions)

What is the problem you're trying to solve

Just some overall questions:

It's not possible to define a service within a compose.yml that does not use a dockerfile or image, right?
How much work would you estimate it to be, if I chose to fork compose and implement such a feature?
Could this be of use upstream again for compose? Or is it scope creeping?
Would there be other tools that serve our purpose better?

Describe the solution you'd like

I work in academia / linguistics, and we use docker and compose heavily. Besides the main reason for docker (easy encapsulation), we also use it more and more for reliable reproduction of data processing pipelines for scientific quality assurance. I.e. we define services handling data within a compose.yml which is versioned with git. This way most data processing can be persisted and made reproducable by just launching a service with a compose.yml at a certain git commit in time while also having the mounted data repos and source code equally commited (and pointed to with git submodules).

E.g.:

compose.yml 1, defining such a pipeline, at some commit, also pointing to submodules:

[data repo 1, at some commit] 
|
mounted as volume, and used as input for code
|
V
[code repo 1, with micro compose.yml and dockerfile, at some commit] 
|
mounted as volume, and produced as output by code
|
V
[data repo 2, at some commit]

This setup works great. However it has the disadvantage of not being as modular as we would like it to be. Because we would like to use any compoment of this chain in arbitrary other chains as well. E.g. the same input data as before but with another application: code repo 2, defined as a whole in compose.yml 2.:

compose.yml 2:

[data repo 1]
|
V
[code repo 2]

Code repos can already be individually defined with a self-contained compose.yml that can be reused in containing compose.yml with extends, but not data repos as they don't have a dockerfile or image. This hinders our aim for increased modularity, where we want to arbitrarily chain data and processing together and persist this whole pipeline for reproducability

Such a "data service" (i.e. compose service withou dockerfile or image) can't be easily done without dirty hacks in vanilla compose, right?

Suggestion

For our purpose it would be great if we could define a static "data service" within a compose.yml that only serves the purpose of providing indexable metadata (what kind of data mostly, what compability with what tools) so that it can be reused in arbitrary containing compose.yml with extends.

I consider allocating my time into such feature. But before doing so, I kindly ask:

How much work would it be to fork compose and implement such a feature?
Is such a feature desirable by compose or is it scope creeping?
or are there other tools that serve our purpose better (I didn't find any)?

Cheers,
Stefan

Nicolas De loof · Answer 1 · Tue Jul 04 2023 21:52:53 GMT+0800 (China Standard Time)

If I understand correctly, the challenge you're trying to address is to define the source for data ("data repo") that eventually results into a volume used by your computation service ("code repo"), and you'd like this source to be expressed in compose.yaml for easier reusability, but obviously there's no associated "service" (i.e. running process)

imho this should be addressed directly by a volume definition. Docker API doesn't offer a standard way to pre-populate a volume, so I can see two approaches here:

define an actual service that you use as initializer to populate a compose volume from some source. Such a service could be a basic docker image to run git clone <data repository> when fresh new volume is empty on first run
create a dedicated docker engine volume plugin (can use https://github.com/vieux/docker-volume-sshfs as example) to do the same, passing data repository URL as volume's driver_opts. You then don't need a "data service"

SteffRhes · Answer 2 · Wed Jul 05 2023 21:33:01 GMT+0800 (China Standard Time)

If I understand correctly, the challenge you're trying to address is to define the source for data ("data repo") that eventually results into a volume used by your computation service ("code repo"), and you'd like this source to be expressed in compose.yaml for easier reusability, but obviously there's no associated "service" (i.e. running process)

Exactly.

Thank you for your suggestions @ndeloof , we will look into something like this.

For others, a discussion took place here too: docker/compose#10758