shadow / shadow

Shadow is a discrete-event network simulator that directly executes real application code, enabling you to simulate distributed systems with thousands of network-connected processes in realistic and scalable private network experiments using your laptop, desktop, or server running Linux.

Home Page:https://shadow.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement vfork syscall

sporksmith opened this issue · comments

Part of #1987

The vfork syscall (or clone or clone3 with the CLONE_VFORK flag) is a way of saving some overhead when spawning a new process. Unlike fork, the child process shares memory with the parent (hence saving the overhead of copying page tables to make memory copy-on-write in the child). The parent process is suspended until the child process exits or execs.

Importance

Use-cases verified to not use vfork:

  • arti uses std::process::Command to spawn pluggable transport processes, which currently uses fork.

  • More generally, vfork will probably not be used much in Rust until rust-lang/libc#1596 is fixed.

  • tor also uses fork, not vfork, to spawn processes.

  • Using strace on a simple bash script on my machine shows it using fork-like clone invocations (not vfork).

Use-cases that do use vfork:

  • The posix_spawn libc function is specified as using vfork. https://www.man7.org/linux/man-pages/man3/posix_spawn.3.html
  • python3's subprocess module
  • dash (which is what /bin/sh resolves to on many systems)
  • Rust's std::process::Command::spawn. It's also unusual in that it uses clone with vfork and a new stack; e.g. from strace: clone3({flags=CLONE_VM|CLONE_VFORK, exit_signal=SIGCHLD, stack=0x7efc570b3000, stack_size=0x9000}, 88)

Feasibility

Implementing the shim-side code for vfork is tricky. Unlike spawning a new thread, the child process continues running on the same stack. Unlike fork, modifications to that stack are seen in the parent as well. Therefore we can't return from our syscall handling functions, since this would corrupt the stack in the parent. We also can't long jump to the point where the syscall was made (as we do when spawning a new thread) and have the parent return normally, since this would also corrupt the stack in the parent.

We might be able to return normally in the child process, and later long-jump in the parent process when it gets to run again. This seems pretty tricky, though.

One possibility is to just treat vfork exactly like fork (and treat the CLONE_VFORK flag as a no-op). In principle this would break code that relies on implementation details of vfork under Linux, e.g. by intentionally writing to parent memory from the child, but relying on such implementation details is already pretty fragile and not-portable. e.g. the vfork man page states that POSIX.1 specifies that

behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.".

Found some more examples that use vfork rather than fork: python3's subprocess module, and the dash shell.

So far all of these bail out instead of falling back to fork.

Confirmed that at least for python3, just treating the vfork as fork works. It's probably more useful to do that and log a warning than to return an error.

std::process::Command::spawn and posix_spawn also provide a non-null stack in their CLONE_VFORK operations. This isn't implemented yet but should be straightforward to do so.

Confirmed all the above usages so far work with vfork treated as fork + supporting the stack switch for this case. Will have a PR out in a bit.

Closing.

Our implemenation of treating vfork as fork is compliant with the POSIX spec, where:

(From POSIX.1) The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.

And so far appears to be compatible with the several process-spawning high-level APIs we've tried that use vfork.

If we end up needing to more precisely emulate Linux semantics it'll make sense to open a new issue.