Implement vfork syscall
sporksmith opened this issue · comments
Part of #1987
The vfork
syscall (or clone
or clone3
with the CLONE_VFORK
flag) is a way of saving some overhead when spawning a new process. Unlike fork
, the child process shares memory with the parent (hence saving the overhead of copying page tables to make memory copy-on-write in the child). The parent process is suspended until the child process exits or exec
s.
Importance
Use-cases verified to not use vfork:
-
arti
usesstd::process::Command
to spawn pluggable transport processes, which currently usesfork
. -
More generally,
vfork
will probably not be used much in Rust until rust-lang/libc#1596 is fixed. -
tor
also usesfork
, notvfork
, to spawn processes. -
Using
strace
on a simplebash
script on my machine shows it usingfork
-likeclone
invocations (notvfork
).
Use-cases that do use vfork:
- The
posix_spawn
libc function is specified as usingvfork
. https://www.man7.org/linux/man-pages/man3/posix_spawn.3.html - python3's
subprocess
module dash
(which is what/bin/sh
resolves to on many systems)- Rust's
std::process::Command::spawn
. It's also unusual in that it usesclone
withvfork
and a new stack; e.g. from strace:clone3({flags=CLONE_VM|CLONE_VFORK, exit_signal=SIGCHLD, stack=0x7efc570b3000, stack_size=0x9000}, 88)
Feasibility
Implementing the shim-side code for vfork
is tricky. Unlike spawning a new thread, the child process continues running on the same stack. Unlike fork, modifications to that stack are seen in the parent as well. Therefore we can't return from our syscall handling functions, since this would corrupt the stack in the parent. We also can't long jump to the point where the syscall was made (as we do when spawning a new thread) and have the parent return normally, since this would also corrupt the stack in the parent.
We might be able to return normally in the child process, and later long-jump in the parent process when it gets to run again. This seems pretty tricky, though.
One possibility is to just treat vfork
exactly like fork
(and treat the CLONE_VFORK
flag as a no-op). In principle this would break code that relies on implementation details of vfork
under Linux, e.g. by intentionally writing to parent memory from the child, but relying on such implementation details is already pretty fragile and not-portable. e.g. the vfork man page states that POSIX.1 specifies that
behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.".
Found some more examples that use vfork
rather than fork
: python3
's subprocess
module, and the dash
shell.
So far all of these bail out instead of falling back to fork
.
Confirmed that at least for python3
, just treating the vfork
as fork
works. It's probably more useful to do that and log a warning than to return an error.
std::process::Command::spawn
and posix_spawn
also provide a non-null stack in their CLONE_VFORK
operations. This isn't implemented yet but should be straightforward to do so.
Confirmed all the above usages so far work with vfork
treated as fork
+ supporting the stack switch for this case. Will have a PR out in a bit.
Closing.
Our implemenation of treating vfork
as fork
is compliant with the POSIX spec, where:
(From POSIX.1) The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.
And so far appears to be compatible with the several process-spawning high-level APIs we've tried that use vfork
.
If we end up needing to more precisely emulate Linux semantics it'll make sense to open a new issue.