Remote execution uses over ~50x more cache transfers than local upstreaming

Question

Remote execution uses over ~50x more cache transfers than local upstreaming

aaronmondal opened this issue a year ago · comments

Aaron Siddhartha Mondal commented a year ago

rules_ll replicates remote execution environments locally in a way that lets users build a project locally and upstream results to a remote cache, such as the buildbuddy remote cache. This way we can reuse artifacts on different machines without remote execution. Building LLVM like this on a machine locally causes ~3.5GB artifact upload:

c484ad18-a047-442a-bf04-ecc81166b525_raw.json.txt

Running roughly the same build (at a later LLVM commit) with a buildbuddy remote executor still has the same ~3.5GB cache upload, but causes ~200GB of cache download:

be9102f9-6977-47b0-8ab9-25ddb38c6265_raw.json.txt

So at 100GB cache transfer on an open source account this is a difference of ~1 build per day and ~0.5 builds per month 😅

It seems that the remote executor refetches artifacts even if it has just built them. This doesn't look like intended behavior to me. If it is WAI because of some sandboxing policy, it might be a good idea to add a suggestion on how to relax the executor-local cache reuse policy.

Siggi Simonarson · Answer 1 · Wed Apr 19 2023 00:44:36 GMT+0800 (China Standard Time)

Hey @aaronmondal - if a remote executor doesn't already have this input in its local disk cache needed to perform a remote action, it must fetch it from the cache. If the executor already has artifact in its local cache, it will not be fetched again. There can be many (dozens, hundreds, +) executors running at given time so it can take a few builds for a given cache artifact to wind up on every executor. You can explore the Executions tab and click on an execution to see the inputs and how large they are.

If much of this size is from toolchains / some large inputs that get pulled in for every action - you can consider putting this on a custom container-image which gets pulled from a docker registry rather than from cache https://www.buildbuddy.io/docs/rbe-platforms/#using-a-custom-docker-image

Aaron Siddhartha Mondal · Answer 2 · Wed Apr 19 2023 04:56:24 GMT+0800 (China Standard Time)

@siggisim Thanks for the swift reply! I might have an idea where that download size is coming from, though this might be a "bug" in the UI.

After looking at the build again it just didnt make sense that there would be such a large download size from the Bazel artifacts:

https://app.buildbuddy.io/invocation/be9102f9-6977-47b0-8ab9-25ddb38c6265

The executors tab here shows ~6000 actions with a max read size of ~0.28KB, which would amount to a max download of ~2MB. That looks a bit strange, but since this was a clean build it might just be profiling artifacts exchanged between executors.

The large amount of CAS hits probably has some part in this. ~500.000 hits at max artifact size of ~130KB are still just max ~65GB though.

However, we already do use a custom image, and that image is quite large at ~2.8 GB. If the docker pull of that image on 50 workers is tracked in the download size, tht would explain the ~140GB that I can't find in the logs (though 50 jobs=50 separate pulls?).

Another part is that http_archive fetches seem to be missing from the logs as well. I expect ~700MB of fetches there which on 50 workers again could contribute to the overall download size.

So one of these, or both might be missing metrics in the UI (or I just couldn't find it?):

Docker pull size
http_archive (and similar) fetch size

It might also be relevant that we are using bzlmod for all of this, and the log might be missing information because fetches triggered by module resolution is not tracked correctly.

Siggi Simonarson · Answer 3 · Wed Apr 19 2023 05:33:53 GMT+0800 (China Standard Time)

Hey @aaronmondal - you can take a look at the Executions tab and sort by "File size downloaded"

It shows that many of the (5,323) remote actions have 700MB + of inputs (these numbers only show networked downloads, and skip artifacts that are already present on the remote executor). You can click on these individual actions and explore the input files to see where this is going.

Docker pull size doesn't affect these stats, since they're pulled from a docker registry rather than from the cache.

http_archives fetches don't count because they are downloaded to your machine that is hosting Bazel, not to the remote executor (unless it's listed as an input to a particular remote action).

Aaron Siddhartha Mondal · Answer 4 · Wed Apr 19 2023 07:49:59 GMT+0800 (China Standard Time)

@siggisim Ahhh now is see it. Ok then this is all clear and this of course fully explains everything.

Maybe it would be a good idea to make that small text larger and higher contrast. On a (fairly high quality) 4k display this is so small and low contrast that I overlooked that text even after looking at these logs for like a really long time. I just always read that 0.28KB number which takes all the visual focus since it is so much more pronounced. I assumed that those 0.28KB values were the download size, completely overlooking the 808 MB value.

I don't have a visual impairment, but an occasional case of "being a very dumb user". Maybe this might classifies as an accessibility issue regardless 😅

Siggi Simonarson · Answer 5 · Wed Apr 19 2023 07:52:07 GMT+0800 (China Standard Time)

Totally agree that execution metadata can be easy to miss, will work on making it more readable!