grammarly / rocker

Rocker breaks the limits of Dockerfile.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pointers to other tools? Alternatives to MOUNT?

hartym opened this issue · comments

So sad to hear about the project discontinuation, but of course we all struggle with time and I understand.

Would it be possible to add a few pointers to other tools that can be used in the "much more mature [container ecosystem]" that we have today ? The homepage now says "most of the critical and outstanding features of rocker can be easily covered by docker build or other well-supported tools", but I have hard time finding tools I could consider in the future.

The future is probably buildkit-based, but it's not looking "mature" yet to me. Also, if the frontend is dockerfile-based, we still have the same issues than before (no volume mount, no artifact generation in dedicated images ...).

As far as I understand, docker build and Dockerfile language evolved to allow "multi-step" builds, but IMHO, this is insufficient to cover all use cases that Rocker covers.

I tried to explore most tools found in awesome docker's build section (https://github.com/veggiemonk/awesome-docker#builder), but I can't find a tool as good, simple and flexible as rocker.

Do you have some pointers for us?

Thanks!

Hey, @hartym. Thanks for bringing it up. I agree with you, not all rocker's features can be covered, at least, from what I could find quickly. We may need to change the disclaimer to make it more clear.

Let's go through those features in more detail:

  1. Multi-FROM. Docker multi-state builds covers this and indeed does this in a bit more idiomatic way. Rocker's EXPORT/IMPORT directives have glitches (they are dirty hacks by nature), while COPY --from is a much more elegant solution. I wanted to do something similar while designing multi-stage in rocker, but there was lack of appropriate APIs to do so that time.
  2. MOUNT – the most used and the hackiest feature. While solving the problem of package manager cache invalidation, it breaks the overall idea of immutability of docker builds and brings some nasty side-effects. It's clear why official docker toolset will never have something like this, but the problem remains. Docker's new --cache-from also doesn't seem to help here, though I'm not sure. I had a theory that the problem can be solved by adding a local package proxy – HTTP-level or even a more specific like npm-proxy-cache – this may speed up the process of loading the dependency tree, replacing the internet with loopback interface, which is almost a disk read. I spent one day of experimenting with no luck, but I still believe it can solve the problem if invest more time in it.
  3. Templating. We use it extensively at Grammarly and hold ourselves to not add new functions to the templater covering our specific use-cases. The problem with templating is similar (in the spirit) to other Rocker features: it gives you much power by breaking the rules. At our practice, some Rockerfiles become so complex and ugly that sometimes require a major refactoring. For example, a complex templated Rockerfile can be refactored into two separate files, and though having code duplication, the files themselves become much clearer. Believe it or not, many things can be done by using Docker's ARG and ENV. After all, no one forbids you to run your own pre-processor before running docker build, or is there a problem with this approach? I don't remember.
  4. TAG/PUSH. Yes, this is a nice and small feature, and a lot of people love it. Still, not a major thing as most of the build procedures are scripts and you can always add another line of bash. The only downside is that you most of the times want the Dockerfile and the repository URL to come together. I sometimes see people committing build.sh scripts along with Dockerfiles – though clutchy, maybe this is the way to go.
  5. ATTACH. Can be done by doing docker run on an intermediate layer, you can see the ID in a docker build output. Not the same thing in Rocker, as ATTACH was also loading corresponding MOUNTs, but it's not relevant if you use docker build.

To summarize, there are no straightforward solutions to the problems which Rocker aimed to solve. But to be honest, Rocker is not an elegant solution either – it just hides those problems under the carpet.

Thanks again and feel free to share your solutions/ideas regarding problems mentioned above.

MOUNT – the most used and the hackiest feature. While solving the problem of package manager cache invalidation, it breaks the overall idea of immutability of docker builds and brings some nasty side-effects. It's clear why official docker toolset will never have something like this, but the problem remains. Docker's new --cache-from also doesn't seem to help here, though I'm not sure. I had a theory that the problem can be solved by adding a local package proxy – HTTP-level or even a more specific like npm-proxy-cache – this may speed up the process of loading the dependency tree, replacing the internet with loopback interface, which is almost a disk read. I spent one day of experimenting with no luck, but I still believe it can solve the problem if invest more time in it.

The great thing about MOUNT is that it works with any package manager that keeps a local cache anywhere. I've used it with great success with yum, pear, composer, npm and even cached 3rd party git repos. Trying to cache all of those through any kind of http proxy quickly becomes infeasible, sometimes you can find an upstream for your package manager that's not over https and then introduce a proxy, but even if that works, you need a proxy environment variable to dynamically exist based on whether you're caching things. If you have a package manager that requires https (or is a git repo...) then you can't just trivially intercept it with a proxy and you're now dealing with custom certs and injecting custom local CAs in to the OS before installing any packages - which'll work great until package managers start pinning certs/CAs and then we're back to needing MOUNT again.

I don't really expect you specifically to solve this, I'm just trying to highlight that I don't think the http approach is portable enough whereas MOUNT works with nearly anything and combined with a little intelligence it's possible to invalidate the build cache at exactly as reasonable a time as you would with a http cache.

Thanks for this extensive answer @ybogdanov !

Definitely some good pointers there and honnestly, 1/3/4/5 are realy things I can live very well without.

  1. Multi-stage build are perfectly fine for this.
  2. ARG/ENV is verbose, but does the trick.
  3. TAG/PUSH can be done outside, not a real issue.
  4. ATTACH was nice, but as you pointed, that's already something we can do manually.

I'm still wondering about MOUNT.

I did not know about --cache-from, but the goal (from the very spartian bits of doc I can find) looks quite different from rocker'sMOUNT (or mount's goals, at least mine).

Pretty much my use case was mounting package manager cache directories (/var/lib/apt, ~/.cache).

The usual argument for not allowing such a thing is indeed "we want immutable / reproductible builds", but that this is already broken: we can "RUN" commands in docker containers that have random side effects, for example a random command, or more realistically, a command that downloads from a package repository that may have altered the files from one build to another (think of the evil wget superscript | sudo bash - anti-pattern. So yes, using a local cache is risky in term of reproducibility, but this risk already exists in Dockerfiles.

Note that I'm not complaining here, I just really think the MOUNT command was the killer feature of rocker and still looking for some good, language agnostic, option to re-use a cache dir. I don't think trusting a local cache of a good maintained package manager is any riskier than trusting the package manager's repository to have untampered packages. Or maybe I'm missing something here? Cause even if there is this risk of the malicious local user (a.k.a me or you) hacing into the cache filesystem to include some malicious code, that's already something he can do without having to cause him such trouble. And if you don't trust yourself (or your employees on this), you can always have a trusted system (something-integration-ish) be the only one able to push images to the registry...

Also, I hear your point with huge Rockerfiles that are hard to maintain, I also feel that this is true both for Rocker and Docker-files.

I guess my answer is not that useful here, but I wanted you to know that, indeed, this feature was appreciated to a huge extent.

And if someone has a good solution for this (yes, I thought about the proxies, but as @neerolyte stated, that's language/package manager specific, and quite a pain to setup. With same drawbacks), let me know!

Cheers, keep up the good work!

Thanks for your comments, @neerolyte and @hartym. Your points are absolutely correct, and I understand and share your struggle.

I was thinking in the background about the MOUNT issue during the last couple of days; there should be alternative ways to keep cache across invalidated builds.

Here is one of the approaches that came to my mind today, I called it "Interstellar":

  1. COPY the [initially empty] cache directory to an image
  2. RUN your build that side-effects some cache, let's say to /root/.cache directory
  3. LABEL the layer so we can find it later
  4. Do other stuff in your Dockerfile
  5. In the wrapping bash script, after the build, find that LABEL'ed layer and run a container from it, mounting current directory to download /root/.cache contents to a local directory
  6. For the next build, the COPY from step 1 will deliver the cache from the previous build

I've pushed the PoC to the interstellar branch, see build.sh and Dockerfile. For convenience, I'll paste the source of the two files below.

Please let me know whether this approach works for you.


build.sh:

# "Interstellar" – PoC of MOUNT alternative with vanilla Docker
#
# This Dockerfile + build.sh illustrates how to keep cache of package managers
# and build systems between different build executions even after RUN layer
# invalidation. This trick help survive without Rocker's MOUNT.
#
# See the discussion https://github.com/grammarly/rocker/issues/199
set -e

# Make sure there is empty directory during the first run
mkdir -p .cache/go-build

# Do a normal build
docker build -t grammarly/rocker:latest -f Dockerfile .

# Use the label trick to get the ID of the latest layer containing cache of Go's compiler
CACHE_LAYER=`docker images --filter "label=rocker_build_cachepoint=true" --format "{{.ID}}" | head -n 1`

# The next command overwrites any older cache, we may improve it by using `rsync` instead of `cp`
echo "Downloading the latest build cache..."
rm -R .cache/go-build

# Store locally the cache left after the latest build
docker run --rm -ti -v `pwd`/.cache:/parent_cache $CACHE_LAYER \
    /bin/bash -c 'cp -R /root/.cache/go-build /parent_cache/go-build'

Dockerfile:

FROM golang:latest as builder

COPY . /go/src/github.com/grammarly/rocker
WORKDIR /go/src/github.com/grammarly/rocker

# Note that on the first "cold" run the local directory ".cache/go-build" is empty
COPY .cache/go-build /root/.cache/go-build
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go install -v

# We use the trick with LABEL to be able to find this layer in build.sh
LABEL rocker_build_cachepoint=true

# Use Docker's multistage build feature to promote only our statically-built rocker binary
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /go/bin/rocker /bin/rocker
CMD ["/bin/rocker"]

After some time of thinking about it, I concluded that "interstellar" approach doesn't solve the problem entirely and there are multiple problems with it:

  • The cache layer is going to be copied each time and remain heavyweight in the tree.
  • Cannot share the cache between multiple projects.
  • Uploading build context takes more time as we also need to upload cache each time.

I don't think this approach is even close to MOUNT's robustness.


I started researching and stumbled upon https://github.com/moby/buildkit, and it looks very promising. I recommend reading the Introductory blog post. Didn't have a chance to examine the "MOUNT" issue in particular, but I think it is a good candidate for a Rocker replacement in general.

Hi Yuriy,

Yes, it's mostly what I meant by "the future is probably buildkit-based", but to be honest, it's much more low-level plumbery than something we can really use for now, I mean without investing a lot of time in implementation details. The future tools, including docker image build will most likely use it as a base, and it will allow a variety of other tools to come out.

For now, I went back to regular Dockerfiles. There are also nice things doing so, but I'm feeling like we're missing a tool right now, especially when we build multiple variants of the same image, (like someapp, someapp-with-ml-libraries, someapp-for-prod, etc.) which share a lot of build steps and forces to duplicate large parts of dockerfiles, as of now. And feeling the pain again of not having the package manager downloads cached ... I guess this step back is what is needed to make the two step forward soon with buildkit-based awesomeness.

Still we're not quite there. :)

We use MOUNT feature to pass ssh-authentication-agent, like:

MOUNT {{ .Env.SSH_AUTH_SOCK }}:/ssh-agent
ENV SSH_AUTH_SOCK=/ssh-agent

This allow build-process to pull a private git repo. With this trick, everyone with access to the private repo, can build it (an there is no need to push credentials in docker-file, or similar place), and it integrates well with jenkins.

This can be done ONLY with real mount (which makes mounting sockets available), and not achievable with copy-like commands.

So I realised that with a fair bit of messing around you can always simulate MOUNT by running an image with -v where you'd normally mount stuff, run some scripts to install packages and then convert that back to an image that can be used as a base in a normal Dockerfile.

# build all our deps in to it
args=(
	docker run -it
	-v "$PWD/.cache/yum:/var/cache/yum"
	-v "$PWD:/scripts"
	--name "$name"-tmp
	"$mybaseimage"
	/scripts/install_packages.sh
)
"${args[@]}"

# convert it to an image
docker stop "$name"-tmp
docker commit "$name"-tmp "$name"-tmp-image

# do some more docker buildy stuff to it
tmpdir=.tmpbuild-$$
mkdir "$tmpdir"
(
	echo "FROM $name-tmp-image"
	echo WORKDIR /foo
	# ... whatever else you'd normally do in a Dockerfile ...
) > "$tmpdir/Dockerfile"
docker build -t "$name"-image "$tmpdir"
rm -rf -- "$tmpdir"

Ugly, but I think I can get the desired behaviour without having to rely on Rocker.

@neerolyte I was telling on a lot of build scripts like this before I found rocker, (building contexts in tars that you pipe in another docker build command ...) the main problem is that it is complex and hard to maintain ...

MOUNT and ATTACH were always my aces in the hole for quickly and easily debugging issues during build processes, and nothing else has really replicated their ease of use yet as far as I know.

I know the above isn't really contributing much value to this ticket but I felt I needed to mourn the loss of rocker after using it constantly for ~2 years :) Thanks for keeping it going as long as you did guys

I have been using gitlab-ci as a replacement to MOUNT for some time now.
Example
For node base projects I set up something like this https://gist.github.com/Larry1123/b9ad384b16035f4200ee4adb3736b67f and just do a COPY of the resulting node_modules and dist files in the Dockerfile.
I'm able to have a cache for any number of types of build tools.
The only issue I have with my current workflow using gitlab-ci is that it is much harder to debug a build without manually following the steps I have setup, as the gitlab-ci does not at this time provide a good way to run locally for dev.
I did not know about the new multi-stage-builds in docker.

I had liked using rocker a lot, but only started using rocker about a year ago and the project had started to look unsupported so I was tring to use only what I could not find in other tools that had support. The mulit-stage-builds along with still using gitlab-ci as part of my build stack, for the most part should cover what rocker did.
I wish there was one tool that covered everything better.

I hope that for those that it can help with sharing what I use in place of mount will help.

Just wanted to note, we have a similar issue around cache invalidation at SeatGeek, and built https://github.com/seatgeek/docker-build-cacher . This works pretty well for our use-case, but may not work for every MOUNT use-case. Posting it here in case it helps anyone (I landed here for other reasons).

Meanwhile, buildkit has been merged into the mainstream as experimental functionality. That means, soon we are probably to see native robust caching. Haven't tried that yet though.

See moby/moby#37151