bazelbuild / bazel

a fast, scalable, multi-language and extensible build system

Home Page:https://bazel.build

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stabilize unresolved symbolic links (--experimental_allow_unresolved_symlinks)

jmillikin opened this issue · comments

Description of the problem / feature request:

#9075 added support for unresolved symlinks, via the actions.declare_symlink() and actions.symlink() functions. That functionality is currently gated by --experimental_allow_unresolved_symlinks.

I've discovered that unresolved symlinks are critical for supporting hermetic Node.JS builds. Specifically, Node's use of symlinks for module resolution semantics mean that we need to be able to create symlink artifacts that reference dependencies and can be used as runfiles, but exist outside the runfiles root.

What are the current blockers to stabilizing unresolved symlinks? Is there a list of known remaining work to get them enabled by default?

cc @lberki @laszlocsomor

@laurentlb , @laszlocsomor , what do you think?

The main blockers I see is that RBE doesn't support this yet (the protocol itself does, but we haven't wired it up yet within Bazel) and that it probably fails on Windows in all sorts of interesting ways.

Neither of these are insurmountable problems and if I squinted the right way, I could even argue that they are not necessary for declaring the functionality stable, but I just can't sign up for doing them at the moment.

@lberki : Why do you suppose it fails on Windows?
Is there a test case I could try on Windows?

Try //third_party/bazel/src/test/shell/bazel:bazel_unresolved_symlink_test.

I admit I haven't tried it on Windows. There is a chance that it works since all that's needed is ActionMetadataHandler#fileArtifactValueFromArtifact() to work which in turn only needs Path#statIfFound(Symlinks.NOFOLLOW) and Path#readSymbolicLink() .

...oh, and Path#createSymbolicLink() for SymlinkAction#execute().

Thanks! The test fails here:

bazel build --experimental_allow_unresolved_symlinks //a:bsg && fail "build succeeded"

...because the build succeeds though it shouldn't. (Note: I edited the test to use c:/badsymlink instead of /badsymlink.)

The culprit is I think the culprit is that WindowsFileSystem does not create file symlinks:

if (!target.toFile().exists() || target.toFile().isDirectory()) {
WindowsFileOperations.createJunction(link.toString(), target.toString());
} else {
Files.copy(target, link);
}

@lberki : test_file_instead_of_symlink breaks on Linux if I add --spawn_strategy=standalone to the bazel calls.

This seems to be failing on macOS if paired with remote caching;

java.lang.IllegalStateException: Encountered symlink input '<some nice path here>', but all symlinks should have been resolved by SkyFrame. This is a bug.
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTreeBuilder.fromActionInputs(DirectoryTreeBuilder.java:106)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTreeBuilder.fromActionInputs(DirectoryTreeBuilder.java:47)
	at com.google.devtools.build.lib.remote.merkletree.MerkleTree.build(MerkleTree.java:123)
	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:121)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:94)
	at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:39)
	at com.google.devtools.build.lib.exec.ProxySpawnActionContext.beginExecution(ProxySpawnActionContext.java:60)
	at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1381)
	at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:966)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1114)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1085)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:136)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:612)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:907)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:297)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)

AFAIK this has been experimental since Bazel 1.0.

Is progress likely?

Any updates?

I hit this recently as well. I was using the feature to break circular deps between npm packages (which sadly exist in some 3rd party packages). Confirmed that it does not work with remote execution.

I also hit a bug in bazel itself related to this where the --experimental_allow_unresolved_symlinks configuration setting is not propagated to the host and exec configs so when building a target under either of those configs that calls actions.create_symlink, bazel errors out with

.

Looking at the history of this issue, the main blockers seem to be Windows and RBE support. Once those are fixed (or we decide that they don't matter), we can move this feature out of the experimental domain.

@coeuvre : What do you think about updating our RBE support to support unresolved symbolic links? I realize you are busy so I was thinking about soliciting pull requests from the community.

@meteorcloudy : given that symlinks are not a very common thing on Windows, can we get away without supporting unresolved symlinks? I don't even know if NTFS supports dangling symlinks; if it doesn't, then the decision is easy because then unresolved symlinks don't make all that much sense, do they?

@lberki My guess is that it will work correctly [--windows_enable_symlinks](with https://bazel.build/reference/command-line-reference#flag--windows_enable_symlinks), NTFS does support dangling symlinks.

Is there a simple test that I can run to verify?

test_smoke in bazel_symlink_test?

This test passes on Windows
https://buildkite.com/bazel/bazel-bazel/builds/19324#75eafab1-c5ae-4830-add5-fbe84692c31e

//src/test/shell/bazel:bazel_symlink_test                                PASSED in 29.4s

Oh, it passes because it's skipped on windows..

@lberki I am happy to review PRs :)

cc @tjgq

@lberki
test_smoke passed on Windows with --windows_enable_symlinks after fixing the path style for Windows (/nonexistent => C:/nonexistent)

** test_smoke ******************************************************************
$TEST_TMPDIR defined: output root default is 'c:\b\erp625is\execroot\io_bazel\_tmp\f8978fe5636e0739a65f37d28c69faa9' and max_idle_secs default is '15'.
Loading: 
Loading: 0 packages loaded
Analyzing: target //a:a (1 packages loaded, 0 targets configured)
INFO: Analyzed target //a:a (4 packages loaded, 6 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:a up-to-date:
  bazel-bin/a/a
INFO: Elapsed time: 0.597s, Critical Path: 0.01s
INFO: 2 processes: 2 internal.
INFO: Build completed successfully, 2 total actions
INFO: Build completed successfully, 2 total actions
total 0
lrwxrwxrwx 1 b None 14 Apr 20 09:21 a -> /c/nonexistent
INFO[bazel_symlink_test 2022-04-20 09:21:04 (+0000)] Cleaning up workspace

�[0;32mPASSED�[0m: test_smoke

I believe Windows isn't really a blocker, dangling symlinks and dangling junctions both work.

@meteorcloudy then that leaves RBE support, I guess.

(plus some glue logic to hard-wire the value of this flag to false within Google, but that, I can throw in since contributions don't work for that one)

Is Alessandro's comment about remote caching on MacOS still relevant?

#10298 (comment)

While working on resolving the various remaining issues around unresolved symlinks in Bazel, I came across a subtle decision we will have to make: When an action is executed with a local symlink-based sandboxed strategy (e.g. linux-sandbox or darwin-sandbox), there are two fundamentally different ways of staging an unresolved, relative symlink:

  1. Stage it as a symlink pointing to the real relative symlink in the unsandboxed exec root. This mirrors the behavior for any other kind of artifact.
  2. Stage the symlink directly as a relative symlink pointing to some other file in the sandboxed exec root. This is the current behavior.

This point is certainly controversial because rules_js would like the behavior to be 1 (@gregmagolan), whereas I could see rules_pkg users wanting the behavior to be 2 (@aiuto). Am I correct?

Also CC @alexjski @larsrc-google who I've noticed are interested in this based on comments on other PRs.

@fmeum I tested your commit that stages it as a symlink pointing to the real symlink (option 1 above), fmeum@8e9fa6a. This fixes the sandbox issues with symlinks in rules_js. In my opinion this is the principled approach as makes all artifacts in the sandbox consistent and retains all information. I.e., you can follow the symlink out of the sandbox to the real symlink to find out where that one points to.

@fmeum I tested your commit that stages it as a symlink pointing to the real symlink (option 1 above), fmeum@8e9fa6a. This fixes the sandbox issues with symlinks in rules_js. In my opinion this is the principled approach as makes all artifacts in the sandbox consistent and retains all information. I.e., you can follow the symlink out of the sandbox to the real symlink to find out where that one points to.

Wouldn't that mean that such a symlink would allow sandboxed action access undeclared inputs? Imagine the unresolved symlink points to a path of an output artifact we do not depend on. If we use strategy 2, the behavior would always be the same, as in we would get an unresolved symlink. In case of 1, depending on action execution order, could we sometimes get an actual file?

@alexjski Can you give an example? You'd only get a symlink to a symlink in the sandbox if you have the symlink outside the sandbox as an input.

In either approach you can call realpath and find out the actual file the symlink is pointing to if that is desired. If you want to retain the information of where the symlink outside of the sandbox is pointing to, as rules_js does, I believe you can only do that with approach 1.

@alexjski Can you give an example? You'd only get a symlink to a symlink in the sandbox if you have the symlink outside the sandbox as an input.

In either approach you can call realpath and find out the actual file the symlink is pointing to if that is desired. If you want to retain the information of where the symlink outside of the sandbox is pointing to, as rules_js does, I believe you can only do that with approach 1.

I just tested that and it seems to me like we have strategy 1, at least to some extent. That makes me think that I am missing some crucial context here and the best bet would be to see what @larsrc-google thinks about that. Anyway, let me share the example at least to show the problem and the kind of scenario I had in mind:

$ cat def.bzl 
def _r(ctx):
  symlink = ctx.actions.declare_symlink(ctx.label.name + "_s")
  ctx.actions.symlink(output=symlink, target_path=ctx.file.file.path)

  output = ctx.actions.declare_file(ctx.label.name)
  ctx.actions.run_shell(
    command = "{ cat %s || true; } >  %s" % (symlink.path, output.path),
    inputs = [symlink],
    outputs = [output],
  )
  return DefaultInfo(files=depset([output]))

r = rule(implementation=_r, attrs = {'file': attr.label(allow_single_file=True)})
$ cat BUILD 
load("def.bzl", "r")

genrule(name = "a", outs = ["file"], cmd = "echo hello >$@")
r(name="b", file="file")

Now:

$ bazel clean && bazel build :b --experimental_allow_unresolved_symlinks --spawn_strategy=sandboxed && echo "Result: '$(<bazel-bin/b)'"
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
INFO: Analyzed target //:b (5 packages loaded, 9 targets configured).
INFO: Found 1 target...
INFO: From Action b:
cat: bazel-out/k8-fastbuild/bin/b_s: No such file or directory
Target //:b up-to-date:
  bazel-bin/b
INFO: Elapsed time: 0.215s, Critical Path: 0.02s
INFO: 3 processes: 2 internal, 1 linux-sandbox.
INFO: Build completed successfully, 3 total actions
Result: ''

Let's try the same after build :a:

$ bazel clean && bazel build :a --spawn_strategy=sandboxed && bazel build :b --experimental_allow_unresolved_symlinks --spawn_strategy=sandboxed && echo "Result: '$(<bazel-bin/b)'"
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
INFO: Analyzed target //:a (5 packages loaded, 7 targets configured).
INFO: Found 1 target...
Target //:a up-to-date:
  bazel-bin/file
INFO: Elapsed time: 0.208s, Critical Path: 0.01s
INFO: 2 processes: 1 internal, 1 linux-sandbox.
INFO: Build completed successfully, 2 total actions
INFO: Build option --experimental_allow_unresolved_symlinks has changed, discarding analysis cache.
INFO: Analyzed target //:b (0 packages loaded, 9 targets configured).
INFO: Found 1 target...
Target //:b up-to-date:
  bazel-bin/b
INFO: Elapsed time: 0.128s, Critical Path: 0.01s
INFO: 3 processes: 2 internal, 1 linux-sandbox.
INFO: Build completed successfully, 3 total actions
Result: 'hello'

This is only my personal opinion, but such interaction seems like an undesirable property in terms of determinism of build results (imagine bazel build :a :b where race between a and b actions decides the result). This is result of strategy 1. If we planted the symlinks relative to sandbox execroot, the symlink would always point to a non-existent file.

I see what you mean @alexjski . That makes sense and symlinks that pointed relatively within the sandbox would work for rules_js.

When I was testing @fmeum's WIP branch, the issue I ran into was in a runfiles tree in the sandbox which seems like a special case not handled properly yet. There was a declare_symlink input in the runfiles which I believes suffers from the same problem you just described where the runfiles tree is a symlink to a symlink outside of the runfiles tree. The sandbox version of that symlink was pointing outside of the sandbox and outside of the runfiles tree to the realpath.

Outside of the sandbox in the runfiles tree:

bazel-out/darwin-opt-exec-2B5CBBC6/bin/examples/npm_deps/bin1.sh.runfiles/aspect_rules_js/examples/npm_deps/node_modules/@gregmagolan/test-b ->
bazel-out/darwin-opt-exec-2B5CBBC6/bin/examples/npm_deps/node_modules/@gregmagolan/test-b

In the sandbox,

bazel-out/darwin-opt-exec-2B5CBBC6/bin/examples/npm_deps/bin1.sh.runfiles/aspect_rules_js/examples/npm_deps/node_modules/@gregmagolan/test-b ->
../../../../node_modules/.aspect_rules_js/@gregmagolan+test-b@0.0.2/node_modules/@gregmagolan/test-b

If we could maintain relative symlinks within the runfiles trees and within the sandbox then that would solve the problem I hit as well.

I'm glad to learn that the current approach can work for rules_js if we fix the runfiles tree issue - I was also worried about the implications the alternative would have had on hermiticity.

I will look into fixing the runfiles tree issue. @alexjski Would you be available as a reviewer for a fix for that?

I put up a PR to make using declare_symlink conditional on whether or not --experimental_allow_unresolved_symlinks is set, aspect-build/rules_js#283. That should be all that is needed in rules_js to use unresolved symlinks so it will be ready for use there once all of your work lands @fmeum .

We've had some discussions about symlinks internally, and there are many more corner cases than I realized. I've started putting together the information we have, so we can avoid making more changes we later have to undo.

We've had some discussions about symlinks internally, and there are many more corner cases than I realized. I've started putting together the information we have, so we can avoid making more changes we later have to undo.

Can you share this information publicly, possibly already while it is being compiled? I have already identified a number of these corner cases and would like to have a central location to collect and discuss them.

Changing to P1 since this is blocking the cut of Bazel 6.0 release.
FYI @lberki @larsrc-google

@gregmagolan Could you confirm whether remote execution support for actions that produce unresolved symlinks is a blocker for rules_nodejs? (This doesn't currently work in Bazel, even after @fmeum's changes; consuming unresolved symlinks remotely, on the other hand, does work.)

@gregmagolan Could you confirm whether remote execution support for actions that produce unresolved symlinks is a blocker for rules_nodejs? (This doesn't currently work in Bazel, even after @fmeum's changes; consuming unresolved symlinks remotely, on the other hand, does work.)

@tjgq Thanks for checking. Actions that output unresolved symlinks running in RBE are not needed. Only that RBE can handle unresolved symlink inputs.

Does anyone have a simple-to-reproduce case where dangling symlinks work with standalone execution but not with sandboxed execution? The cases I thought we have actually fail without --experimental_allow_unresolved_symlinks, so they're not useful for this.

I have resolved a major bug in my understanding of the symlink handling, and believe that with #15982 there are no more blockers in the way symlinks are handled for sandboxes.

@larsrc-google do I understand correctly that the last blocker was #16290 which just got fixed?

No more blockers from my side. @tjgq is tracking the RBE-related bits.

...and @meisterT says that it's all good on the remote execution side, so then I assume all bits are ready, except for flag administration (we don't want this to be enabled within Google)

Can someone clarify what else action items are needed for this issue? If none, let's close this one to reduce the number of release blockers for 6.0.

I do believe that covers everything needed for 6.0.

@larsrc-google Aren't we missing a commit that removes the experimental prefix, enables the flag by default and disables it in Google's .bazelrc?

Arg. Yes, you're right. Starting on that.

@meisterT what about the rename?

See my comment here: #16458 (comment)